high performance optimization on cloud for a metal process...

69
UPTEC F 14032 Examensarbete 30 hp Juni 2014 High Performance Optimization on Cloud for a Metal Process Model Adam Saxén

Upload: others

Post on 01-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

UPTEC F 14032

Examensarbete 30 hpJuni 2014

High Performance Optimization on Cloud for a Metal Process Model

Adam Saxén

Page 2: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

High Performance Optimization on Cloud for a MetalProcess Model

Adam Saxén

The Amazon Elastic Compute Cloud (EC2)is a service providing on-demand computecapacity to the public. In this thesis ascientific software, performing globaloptimization on a metal process model,is implemented in parallel using MATLABand provisioned as a service from AmazonEC2.

The thesis is divided into two parts.The first part concerns improving theserial software, analyzing differentoptimization methods, and implementing aparallel version; the second part isabout evaluating the parallelperformance of the software, both ondifferent computer resources in AmazonEC2 and on a local cluster. It is shownthat parallel performance of thesoftware in Amazon EC2 is similar andeven surpasses the local cluster forsome provisioned resources. Factorsaffecting the performance of the globaloptimization methods are found andrelated to network communication andvirtualization of hardware, where themethod MultiStart has the best parallelperformance. Finally, the runtime forlarge optimization problem wassuccessfully reduced from 5 hours(serial) to a few minutes (parallel)when run on Amazon EC2; with the totalcost of just 25-30$.

ISSN: 1401-5757, UPTEC F14 032Examinator: Tomas NybergÄmnesgranskare: Andreas HellanderHandledare: Kateryna Mishchenko

Page 3: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

Contents

1 Introduction 71.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Problem description 102.1 Hot Rolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Optimization theory 123.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Optimality conditions . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Global optimization . . . . . . . . . . . . . . . . . . . . . . . . . 143.4 MATLAB Optimization . . . . . . . . . . . . . . . . . . . . . . . 15

3.4.1 Local optimization method fmincon . . . . . . . . . . . . 153.4.2 Global Optimization method MultiStart . . . . . . . . . . 163.4.3 Global Optimization method GlobalSearch . . . . . . . . 163.4.4 Global Optimization method Patternsearch . . . . . . . . 16

4 High Performance Computing theory 174.1 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1.1 Parallel performance metrics . . . . . . . . . . . . . . . . 174.1.2 Performance limitations . . . . . . . . . . . . . . . . . . . 18

4.1.2.1 Parallel delay . . . . . . . . . . . . . . . . . . . . 184.1.3 MATLAB Parallelism . . . . . . . . . . . . . . . . . . . . 19

4.1.3.1 fmincon and GlobalSearch parallelism . . . . . 214.1.3.2 MultiStart parallelism . . . . . . . . . . . . . . . 224.1.3.3 Patternsearch Parallelism . . . . . . . . . . . . 22

4.2 Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2.1 Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 The Software - Analysis 255.1 The model and its performance . . . . . . . . . . . . . . . . . . . 255.2 Code improvements . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.2.1 Memory efficiency . . . . . . . . . . . . . . . . . . . . . . 275.2.2 Redundant computations . . . . . . . . . . . . . . . . . . 275.2.3 MEX-Files . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.3 Parameter study . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.3.1 Discretization of the temperature field . . . . . . . . . . . 295.3.2 Convergence study . . . . . . . . . . . . . . . . . . . . . . 305.3.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.4 Software Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . 325.4.1 The workload of the software . . . . . . . . . . . . . . . . 325.4.2 Implementation of parallelism . . . . . . . . . . . . . . . . 34

5.4.2.1 fmincon and GlobalSearch . . . . . . . . . . . . 345.4.2.2 MultiStart . . . . . . . . . . . . . . . . . . . . . 355.4.2.3 Patternsearch . . . . . . . . . . . . . . . . . . . 365.4.2.4 Performance analysis . . . . . . . . . . . . . . . 36

3

Page 4: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

5.4.3 Parallel limitations . . . . . . . . . . . . . . . . . . . . . . 385.4.3.1 Parallel Overhead . . . . . . . . . . . . . . . . . 385.4.3.2 Granularity . . . . . . . . . . . . . . . . . . . . . 405.4.3.3 Load balancing . . . . . . . . . . . . . . . . . . . 41

5.5 Summary of analysis . . . . . . . . . . . . . . . . . . . . . . . . . 42

6 Computer resources 42

7 Cloud Assessment - Method and Results 437.1 Cloud Computing through Mathwork’s CloudCenter . . . . . . . 43

7.1.1 Optimization method comparison . . . . . . . . . . . . . . 437.1.2 Maximized number of workers . . . . . . . . . . . . . . . . 44

7.2 Comparison Cloud instances . . . . . . . . . . . . . . . . . . . . . 457.3 Cloud bursting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

8 Cloud Assessment - Discussion 478.1 Cloud Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 478.2 The ADM as a Service . . . . . . . . . . . . . . . . . . . . . . . . 48

8.2.1 Cloud considerations . . . . . . . . . . . . . . . . . . . . . 498.2.2 Hybrid solutions . . . . . . . . . . . . . . . . . . . . . . . 50

8.3 Cost analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

9 Conclusions 519.1 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

9.1.1 Heterogeneity within Amazon EC2 . . . . . . . . . . . . . 529.1.2 Replacing Mathwork’s CloudCenter . . . . . . . . . . . . 52

A Study of data storage and security on cloud 55A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

A.1.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . 55A.1.2 Placement group . . . . . . . . . . . . . . . . . . . . . . . 56

A.2 Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . 56A.2.1 Storage types . . . . . . . . . . . . . . . . . . . . . . . . . 57

A.2.1.1 Amazon Elastic Block Storage (EBS) . . . . . . 57A.2.1.2 Amazon Instance Store . . . . . . . . . . . . . . 58A.2.1.3 Amazon Simple Storage Service (S3) . . . . . . . 59

A.3 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59A.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 59A.3.2 Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

A.3.2.1 Network Security . . . . . . . . . . . . . . . . . . 60A.3.2.2 Interfaces . . . . . . . . . . . . . . . . . . . . . . 60A.3.2.3 Data Security . . . . . . . . . . . . . . . . . . . 61A.3.2.4 Virtualization . . . . . . . . . . . . . . . . . . . 62A.3.2.5 Governance . . . . . . . . . . . . . . . . . . . . . 62A.3.2.6 Compliance . . . . . . . . . . . . . . . . . . . . . 63A.3.2.7 Legal issues . . . . . . . . . . . . . . . . . . . . . 63

A.3.3 Amazon Virtual private Cloud (Amazon VPC) . . . . . . 63A.3.3.1 The infrastructure . . . . . . . . . . . . . . . . . 64A.3.3.2 Amazon Direct Connect . . . . . . . . . . . . . . 65A.3.3.3 Dedicated Instances . . . . . . . . . . . . . . . . 65

4

Page 5: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

B Parameter study 66B.1 Discretization - method and results . . . . . . . . . . . . . . . . . 66B.2 Convergence study - method and results . . . . . . . . . . . . . . 67

B.2.1 Parameter study - Additional plots . . . . . . . . . . . . . 68

5

Page 6: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

1 Introduction

1.1 Purpose

The purpose of this report is to investigate how a scientific program can beparallelized and operated through a cloud service, in particular extending andanalysing a global optimization software developed by ABB, to perform paralleloptimization on the Amazon Elastic Compute Cloud (EC2). The software inquestion optimizes a metal working process called Hot Rolling.

The general method will be as follows:

• Analyze the software in preparation to implement efficient parallel versionof the software.

• Implement and test the parallel software on both a traditional cluster atABB and a public Amazon EC2 virtual cluster.

• Evaluate parallel performance of the software and assess its use as a cloudservice.

1.2 Scope

The report covers a short introduction to Hot Rolling; some optimization the-ory; high performance computing theory; the definition and infrastructure ofcloud computing; a thorough analysis of the software; benchmarks of parallelimplementations of the software; and an assessment of the software as a serviceprovisioned through the Cloud.

The software is implemented in MATLAB utilizing MATLAB’s Global Opti-mization Toolbox (GADS), Parallel Computing Toolbox (PCT) and DistributedComputing Server (MDCS). The optimization problem constitutes a multi-physical model in 33 dimensions with a set of non-linear inequality and equalityconstraints.

This report is limited to the public cloud Amazon EC21, which is one of manycommercial clouds today. The choice of programming language is limited toMATLAB. Mainly due to the initial software which was developed in MATLAB,but also because of the ease of adding parallelism through existing toolboxes.

1.3 Background

Many scientific applications require immense computing resources, well beyondthe capability of regular PCs, where one example would be MAPAS[1], a tool forprediction of membrane-contacting protein surfaces. Users constantly strive forbetter computing alternatives to further shrink the computational time-frame.Today, solutions constitute parallel computing, cluster computing, and the moreelusive, cloud computing. The concept of parallelism is simple. By dividing anddistributing computations over many cores and computers, computation timescan be reduced from weeks to hours. The implementation of these solutions

1https://aws.amazon.com/ec2/

6

Page 7: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

is however not straight forward; knowledge of programming and experience inbasic system administration are required to fully utilize the power of parallelcomputing[5].

This report concerns cloud computing for three reasons, where the first is partlyaddressed. Scientists and companies often lack the knowledge to fully take ad-vantage of the parallel capabilities of a distributed system. Cloud computingoffers service models called Infrastructure as as Service (IaaS), Platform as a Ser-vice (PaaS), and Software as a Services (SaaS)[13], that could lower the thresh-old of utilizing these capabilities. Hence the cloud environment is a promisingcomputing resource for scientists and companies.

The second reason is from a business perspective. A customer investing in e.g.an optimization method for their industrial process, would need to acquire com-puting resources. For parallel application this could require a distributed systemof on-site computers, like a cluster. This is associated with high upfront costs,fluctuating utilization and maintenance costs. The cloud, as an alternative to anon-site cluster, offers on-demand, remote computing resources. The customeronly pays for the computing power when used, saving both time and up-frontcosts[2].

The last reason concerns acceptable HPC performance in the cloud. Previousresearch[3], assessed High Performance Optimization on the cloud of being afeasible and a comparable choice to traditional clusters. This report will partlybe a continuation of [3], investigating another optimization problem and broad-ening the perspective of a cloud application.

Cloud Computing infrastructures have been investigated for their appropriate-ness as scientific computing platforms. A study[4], released 2011, is one suchinvestigation and its key findings are noteworthy. One key finding shows that ap-plications with low communication and I/O are well suited for the cloud. For ex-ample, there were severe slowdowns (7x) for a tightly coupled code (PARATEC)on the Amazon EC2 instances, when compared to a Magellan bare-metal (non-virtualized) computing grid. More findings state that Cloud Computing stillexhibits significant programming and system administration support and thatcloud environment introduces new security issues; all important issues that willbe considered in this report.

7

Page 8: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

1.4 Definitions

HPC High Performance Computing

AWS Amazon Web Services

Amazon EC2 Amazon Elastic Compute Cloud

AMI Amazon Machine Image

Amazon S3 Amazon Simple Storage Service

IaaS Infrastructure as a service

PaaS Platform as a service

SaaS Software as a service

GADS Global Optimization Toolbox

VM Virtual Machine

OS Operating System

std standard deviation

Gbps Gigabyte per second

KKT Karush Kuhn Tucker

MS MultiStart

GS GlobalSearch

PS Patternsearch

ADM Adaptive Dimension Model

Weak Scaling Increasing number of cores while keeping the work per core constant.

Strong Scaling Increasing number of cores while keeping total work constant.

8

Page 9: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

2 Problem description

This report consists of two parts. The first is about parallelizing the software,and the second part concerns the use of cloud services for running the parallelsoftware.

The first part: The ABB Hot Rolling software is by itself a time consumingprocess, requiring the computation of costly objective functions and constraintfunctions. However, depending on the purpose of the optimization, the exe-cution time could radically increase. For example, when requiring global opti-mization or when computing multiple objective functions. Parallelism reducesthe execution time to a reasonable time-frame, by dividing the computationalwork into smaller parts that run simultaneously. Implementing a parallel ver-sion of the software requires a performance analysis and basic understanding ofthe software. From this information parallelism can effectively be implementedon the software, where effectiveness is measured in performance metrics like ex-ecution time, speedup and efficiency.

The second part: Moving the parallelized software to the Amazon ElasticCompute Cloud (EC2) offers desired features like scalability and on-demandresources, but also raises questions regarding computing performance, storageand security. The problem consists of investigating how the parallel softwareperforms on a virtual cluster in the EC2 cloud compared to a physical clusterat ABB. Lastly, cloud services offer a variety of computer resources tailored forspecific software and cost requirements. Finding a suitable resource specificationfor the software is also a part of the problem.

2.1 Hot Rolling

The industrial process called rolling reshapes metal into desired proportions. Itis done by feeding metal through a Rolling Mill consisting of stands that pressthe metal using a set of rolls (Fig. 1).

Figure 1: Illustration of the Hot rolling process. A Hot Rod and Wire Rollingmill, with vertical and horizontal stands, reshapes the metal.

9

Page 10: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

The Rolling process can be divided into two categories, namely hot rolling andcold rolling. When metal is rolled at a temperature above the recrystallizationtemperature the process is referred to as hot rolling; when rolled below therecrystallization temperature it is called cold rolling. This is important sincemetal properties such as strength and ductility are affected by the microstruc-ture of the metal, which in turn is affected by the rolling temperature[6].

There are many types of Rolling, such as flat, ring, wire and rod rolling. Thevariant determines what kind of Mill that is required (roll types, number ofstands, etc.). The model considered in this report simulates a 10 stand hot rod& wire rolling mill. The metal, referred to as a billet, passes three stages:

1. Furnace: increases the billet’s temperature, enabling plastic deformation

2. Passes through stands which reshape the billet to requirements, in termsof geometry and metal properties

3. Cooling: decreases temperature to fixate the new metal-profile and metalproperties

The billet’s cross section area is carefully reduced with every pass of a stand,containing rolls with a specific groove. Usually a combination of vertical standsand horizontal stands are used to obtain the desired geometry; grooving thebillet in two dimensions with both oval and circular grooves (Fig. 1).

The simulation model is described by a non-linear multi-physics problem. Thereshaping of the metal depends on many factors such as tension, rolling speed,initial temperature, friction, the type of metal, etc. Naturally, simplificationsand reasonable discretizations of the process are used to give a sensible compu-tation time. One simplification is to approximate the oval cross section of thebillet with a rectangular cross section of the same area.

An important assumption for simulating these types of Rolling processes is thatthe mass flow and volume stay constant during deformation, i.e. the mass flowinto the stand equals the mass flow out of the stand [7, p.27]. The simplifiedrectangular cross section is calculated through the spread bi and thickness hi ofthe metal (Fig. 2).

Considering that the density of the metal is constant the conservation of massflow can be stated

mi = vibihi (1)

which is the the velocity multiplied with the width and height of a billet section i.

There are various challenges when rolling metal associated with the propertiesof the final product or the whole rolling process. Hence, optimization methodsfor e.g. minimizing the grain size of the finished product, or minimizing theeffect per production speed of the mill, are of interest.

10

Page 11: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

Figure 2: Simplified rolling, where sub-indices describe properties before (1)and after (2) the stand. Index i indicates the disk.

3 Optimization theory

3.1 Definition

In a mathematical context optimization is about finding the maximum or min-imum of a function. Given a vector of variables x, objective function f(x) andconstraints gi(x), an optimization problem can be defined as

minxf(x)

subject to gi(x) = 0, i ∈ ε (2)

gi(x) ≤ 0, i ∈ I

where I and ε are the indices for inequality and equality constraints, respec-tively. Bounds on x are a special type of linear inequality constraints and areincluded in gi, i ∈ I[8].

Some further definitions are needed when discussing optimization problems.

Definition 1: A point x is said to be a feasible point if it fulfills all constraintsgi.

Definition 2: The set of all feasible points form the feasible region, denoted D.

Definition 3: A feasible point, denoted x∗, is called a local optimizer iff(x∗) ≤ f(x) holds for all x in a feasible region confined by |x−x∗| ≤ δ, δ > 0.The pair (x∗, f(x∗)), of the optimizer and locally optimal objective functionvalue, is referred to as the local optimum.

11

Page 12: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

A simple example of two inequality constraints g1, g2, limiting the feasible re-gion is presented in figure 3. The contour illustrates the value of the objectivefunction f(x). In this example the minimum is the point in the feasible regionclosest to the contour center.

Figure 3: Representation of the feasible region restricted by the constraints.

When solving an optimization problem there exists no general solution strategy.Instead there are methods tailored to specific optimization problems. Commonfor all methods is that they are iterative, meaning that they begin with an initialguess and create improved estimates (iterates) until certain termination criteriaare met.

3.2 Optimality conditions

An optimal point x is defined according to def. 3. To determine if a pointfulfills these criteria, intuitively, all points in a surrounding neighborhood of xshould be examined, to see if all of them have a higher function value. In thissection optimality conditions are presented for smooth functions, that when ful-filled state that a point is a local optimizer x∗ of the objective function.

Definition 4: A function f is smooth if it belongs to the differentiability classC2, meaning that the function is twice continuously differentiable.

For unconstrained optimization problems the two optimality conditions are:

First Order Necessary Condition: If x∗ is a local minimizer and f is con-tinuously differentiable in an open neighborhood of x∗, then ∇f(x∗) = 0

Second Order Necessary Condition: If x∗ is a local minimizer, and f and∇2f(x∗) are continuous in an open neighborhood of x∗, then ∇f(x∗) = 0 and∇2f(x∗) is positive semi-definite.

12

Page 13: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

Constrained optimization problems require more sophisticated optimality con-ditions, defined through Lagrange multipliers. Lagrange multipliers are vari-ables λi that combine the function f and the constraints gi into a Lagrangianfunction L:

L(x, λ) = f(x) +

m∑i

λigi(x) (3)

The concept of Lagrange multipliers is based on the intuition that f(x) cannotbe increasing when stepping in the neighborhood of points where gi = 0; this istrue when ∇f and ∇g are parallel.

To define general optimality conditions the Karush-Kuhn-Tucker (KKT) condi-tions are used [9]:

First Order Necessary Condition:

∇f(x) +

m∑i∈ε∪I

λi∇gi(x) = 0 (4)

gi(x) = 0, i ∈ εgi(x) ≤ 0, i ∈ I (5)

λi ≥ 0, i ∈ I (6)

λigi(x) = 0, i ∈ I (7)

The KKT conditions state that the derivative of the Lagrangian function L(eq. 3), with respect to x, is equal to 0. The Lagrange multipliers for the in-equality constraints need to be ≥ 0, since the gradient of f(x∗) and gi(x

∗) haveopposite directions at x∗. The complementary slackness condition (eq. 7) is re-quired since the constraint terms in eq. 5 are zero in the set of possible solutions.

Second Order Necessary Condition: Let Z(x∗) be a basis for the null spaceof ∇g(x∗)T . Then the second order necessary conditions states that

Z(x∗)T∇2xxL(x∗, λ∗)Z(x∗)

is positive semidefinite.

3.3 Global optimization

Finding the global minimum is equivalent to finding the smallest minimumamong all local minima (Fig 4).

A global optimum is defined as follows:

Definition 5: A global optimizer is defined in analogue to a local optimizer(def. 3) but where the feasible region is equal to the entire feasible set D.

The process of finding the global optimum is simple if the optimization prob-lem is convex, then the local optimum will be equal to the global optimum.However, if the problem is complex (non-linear and non-convex) the process of

13

Page 14: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

finding the global optimum will be more difficult, and proving that the foundoptimum is the global optimum is not as straight forward.

Definition convexity: A set of points S is said to be convex if the line segmentbetween any two points (x, y) lies entirely within S:

∀x, y ∈ S, ∀α ∈ [0, 1] : αx+ (1− α)y ∈ S

A function f is convex if its domain is a convex set and the following propertyholds:

f(αx+ (1− α)y) ≤ αf(x) + (1− α)f(y), ∀α ∈ [0, 1]

Figure 4: Representation of local vs global minimum.

3.4 MATLAB Optimization

In MATLAB there are two toolboxes that handle local and global optimization:Optimization toolbox and Global Optimization Toolbox (GADS), respectively.They contain a variety of methods for solving a range of different optimizationproblems. For the non-linear, non-convex and constrained hot rolling problema selection of methods are presented in this section.

3.4.1 Local optimization method fmincon

The fmincon method finds the local minimum of a nonlinear constrained op-timization problem by specifying an initial point from which the method williteratively converge to the closest local optimum. The initial point is the start-ing guess from where the method will step (iterate) to a better estimate. Thisis the only method from the optimization toolbox suitable for the Hot Rollingproblem. For more information regarding the other methods available throughthe toolbox see [10].

Fmincon contains a collection of algorithms that govern how a local minimumis reached. The algorithms are Active-set (AS), SQP (sequential quadratic pro-gramming), Interior-point (IP) and Trust-Region-Reflective. Common for all

14

Page 15: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

four is the use of gradients and Hessians as a part of the optimization. Trust-Region-Reflective opposed to the other three algorithms requires an analyticgradient ∇f ; this is not available for the Hot Rolling problem and is thereforenot considered.

Active-set and SQP are similar in the sense that they both solve QuadraticProgramming (QP) sub-problems. Quadratic Programming concerns optimizinga quadratic function:

q(d) =1

2dTHd+ cT d

The function q(d) is a quadratic approximation of the change of the Lagrangianfunction L (eq: 3), where H is the Hessian of L and c the gradient. Minimizingthe quadratic functions yield the change d of the optimization variable x. Bothalgorithms iteratively solve a series of sub-problems, where every iteration con-sists of computing Hessians and gradients, and applying line-search methods inorder to find appropriate search directions towards the optimum.

The Interior-point algorithm attempts to solve the constrained minimizationproblem by adding a logarithmic term to the objective function, called thebarrier function. Slack variables are introduced to handle inequality constraintsand will together with the barrier function keep the points in every iterationwithin the feasible region.

3.4.2 Global Optimization method MultiStart

MultiStart is a method in the Global Optimization Toolbox. As the name im-plies MultiStart generates multiple initial points that then are evaluated by alocal optimization method. This is done by generating initial points stochasti-cally, running the local method for each initial point, and selecting the smallestminimum among the found local minima. The choice of local method will befmincon, but there are several documented alternatives [11].

3.4.3 Global Optimization method GlobalSearch

GlobalSearch, like MultiStart, utilizes a local optimization method in the processof finding the global optimum. However, GlobalSearch does not run the localmethod for all initial points. Instead it uses a more sophisticated techniquewhere initial point are filtered before running the local method fmincon. Thisis done by scoring candidate points based on their objective function valueand to what degree they violate constraints. Points with a score lower thana certain threshold, and that are not in the vicinity of points already assessedby fmincon, will be run by fmincon. GlobalSearch will therefore cover a largersearch space than MultiStart when run in serial. For a detailed description ofhow GlobalSearch finds the Global minimum see [11].

3.4.4 Global Optimization method Patternsearch

Patternseach is the name of a direct search method in MATLAB that at-tempts to find the global minimum without the use of gradients or Hessians.

15

Page 16: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

It does this by directly computing objective function and constraint values ofpoints in a pattern.

The process is iterative, where every iteration consists of a pattern surroundingthe current smallest point, x∗. There are several points in the pattern, whichwill be computed and compared to x∗. If a point with lower objective functionvalue is found, it will become the new x∗ in the next iteration and the patternwill expand. If no point with smaller value is found the pattern will contractwith the same x∗ passed to the next iteration. The contraction and expansionproperties is the reason to why a global minimum can be found. Patternsearchsupports several different poll-algorithms that determine how the pattern shouldbe created. For example, a pattern can be chosen to evaluate points in everycomponent direction.

4 High Performance Computing theory

4.1 Parallelization

Parallelizing a serial software will allow the user to obtain the same results inless time, or run larger problems in the same amount of time. Achieving thisrequires an analysis of the program to find portions of the work that can bedone simultaneously.

To what degree a program can be parallelized sets an upper bound on theobtainable execution time. For example, a program takes 100s to run in serial.It turns out that 90s of the 100s can be done in parallel. Then regardless ofhow efficient the parallelism is, it will always take 10s to e.g. initialize theprogram. To understand the benefits and limits of parallelization a quantifiedmeasurement of parallel performance is required.

4.1.1 Parallel performance metrics

Speedup Sp is a relative metric defined as the ratio between the executiontime when the program is run in serial, Ts, and the execution time when run inparallel, Tp, where p is the number of cores.

Sp =TsTp

(8)

For example, a program takes 153s to run on a single core computer and 14s ona 12 core system. The speedup is then 153/14 = 10.9, or 10.9 times faster thanthe serial program.

Efficiency λ is another metric that is closely linked to speedup. Instead ofmeasuring how much faster a parallel version is compared to a serial version,efficiency measures the utilization degree of the computer resources (cores).

λp =Sp

p(9)

16

Page 17: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

The metric is expressed as speedup divided with the number of cores, where 1means that 100 % of the computer resources are used.

4.1.2 Performance limitations

There always exists an upper limit on parallel performance. Using the speedupmetric this limit is described by Amdahl’s law [12].

Theorem 1. Amdahl’s law: The maximum speedup S, using p cores, isbounded above by the serial fraction of the program, denoted f :

Sp ≤TsTp

=Ts

Ts(f +1− fp

)

=1

f +1− fp

(10)

The theorem states that the serial program can be divided into a serial fractionf and parallel fraction (1 − f)/p. This implies that the serial fraction of anapplication will pose as the maximum speedup.

For example, imagine a program where the fraction between serial and parallelwork is divided roughly 25% / 75 % (Fig. 5).

Figure 5: The serial fraction of the software imposes an upper limit on theparallel performance.

Then according to Amdahl’s law the maximum speedup Smax will be equal to4 when p grows large, since:

Smax = limp−>inf1

f +1− fp

=1

f= 4, f = 25% =

1

4

This could be a program that uses 25% of the execution time to initialize databefore running parallel computations.

4.1.2.1 Parallel delay

Many parallel applications tend to fall short of the analytical speedup limit.Delays are introduced when dividing and managing the work in parallel. Thiscould be communication delays between cores and job scheduler, or delays dueto uneven workload distribution among cores. This makes it hard to predicthow well a program will perform when parallelized. The following list presentsa grouping of possible reasons for a delay:

17

Page 18: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

• CommunicationParallel cores need to receive work and return results to the client. Thecommunication delay is proportional to the size of the sent data and thenetwork bandwidth. Decreasing communication delays, by removing un-necessary data traffic from the software will improve parallel performance.

• Load balancingIf a set of cores receive one task each that take different amount of timeto compute, evidently the core with the largest task will limit the totalexecution time of the software. Improving load balancing by introducingmore tasks per worker would even out the workload, keeping a higheramount of cores utilized while running the software. Higher utilization ofthe computing resources equate to better parallel efficiency.

• General OverheadDelays caused by dividing the serial work into parallel work by a singlecore, and various other initialization procedures, add delays not presentin a serial program. Also running an application on a shared cluster couldresult in a race for computer resources, where delays occur when waitingfor other jobs to finish.

4.1.3 MATLAB Parallelism

The Parallel Computing Toolbox (PCT) contains all necessary functions for im-plementing parallelism in MATLAB, and Distributed Computing Server (MDCS)extends the parallelism to several computers in a cluster or on a computingcloud. In MATLAB the concept of a worker is used, which is a separate MAT-LAB instance running on a single core. The Worker has its own workspace andvariables. There are many ways to make use of multiple workers; for exampleby calling parfor, batch or spmd within a matlabpool environment or launchingjobs to a job scheduler through the createjob and createtask commands.

Every parallel section of a script needs to have access to a pool of workers, calleda matlabpool. The pool is created by calling matlabpool:

1 matlabpool(’open’,4);

2 %Exection of parallel code

3 matlabpool(’close’);

This will open a pool with 4 workers, available for parallel computations. Afterthe computations are done the pool resource is closed. The PCT allows MAT-LAB developers to run up to 12 Workers. If more workers are required MDCSis necessary.

The most straightforward way in MATLAB to parallelize code is to replace for-loops with parfor -loops. parfor is a parallel for-loop that automatically slicesthe iterations and distributes them dynamically to available workers.

18

Page 19: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

1 matlabpool(’open’,8);

2 parfor i=1:10

3 result(i)=fmincon(@objectiveFunction,x_initPt(i),...);

4 end

5 matlabpool(’close’);

The above code example distributes 10 runs of the local optimization methodfmincon to 8 available workers. The workers are assigned iterations dynami-cally through the built-in job scheduler, improving load balancing between theworkers.

SPMD(single program multiple data) is another option for performing parallelcomputations. Like parfor it requires a matlabpool, but will not divide the workwithin the SPMD statement dynamically to the workers.

1 matlabpool(’open’,2);

2 spmd

3 %Labindex is the Worker id

4 if labindex == 1

5 result = fmincon(@objectiveFunction1,x_init,...);

6 elseif labindex == 2

7 result = fmincon(@objectiveFunction2,x_init,...);

8 end

9 end

10 matlabpool(’close’);

11 %Composite variable that contains result(1) and result(2)

12 result(:);

Instead, all code within a SPMD statement will be executed on every worker.The labindex variable can be used to manually divide data sets or tasks betweenworkers using their indices (labindex). The above script shows how workers canevaluate one objective function each and store the results from the computationin the variable result.

Batch is a command for offloading scripts with parallel computations from theclient MATLAB session. The command does not require a matlabpool to becalled in advance, however batch can start up a pool on the worker receivingthe script.

1 %clientScript.m

2 job = batch(’parallelScript’,’Pool’,8); %Non-blocking

3 %Do other work...

4 %Require parallel before continuing.

5 wait(job); %Blocking command

6 load(job,’res’); %Fetch variable computed in script

7 delete(job);

1 %parallelScript.m

2 parfor i=1:12

3 res = fmincon(@objectiveFunction,x_init(i,:),...);

4 end

19

Page 20: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

This approach will reduce the number of available workers with 1, since oneworker will execute the script sent by the client Session.

Noteworthy is that MATLAB does not support nested parallelism which meansthat fmincon, run within e.g. a parfor -loop, can not parallelize its internal gra-dient computations. For information on how to implement parallelization withfmincon see [10].

4.1.3.1 fmincon and GlobalSearch parallelism

The local optimization method fmincon supports parallelization of the gradientcomputations:

∇f(x) =

f(x+ ∆1e1)− f(x)

∆1︸ ︷︷ ︸Worker

,f(x+ ∆2e2)− f(x)

∆2︸ ︷︷ ︸Worker

, · · · , f(x+ ∆1en)− f(x)

∆n︸ ︷︷ ︸Worker

where f is the objective function, ei is the unit vector for component i, and ∆i

is the step size in ei.

The components of the gradient are distributed to a set of workers that per-form the computations simultaneously. This type of parallelization is effectivewhen the time of evaluating the objective and constraint function is consid-erably larger than the inherited distribution time.Where distribution refers tocommunicating input data and returning results from workers in every iterationof the main loop of fmincon.

Globalsearch has no native parallelism due to its iteration-dependency; the solverevaluates new candidate points based on the previous iteration’s fmincon re-sults(Sec: 3.4.3). However there is still the possibility of using fmincon’s par-allelism to improve the execution time of globalsearch. This will parallelize theparts of globalsearch containing fmincon, but will keep the serial evaluation ofcandidate points.

20

Page 21: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

4.1.3.2 MultiStart parallelism

For multistart the number of initial points constitute the majority of the ap-plication’s workload. Several fmincon runs, with different initial points, can becomputed independently. This makes MultiStart well suited for parallelization.Multistart uses a parfor -loop that contains a call to fmincon. The iterationcount is equal to the number of initial points to be evaluated. When workersare available the parfor will dynamically distribute new initial points to freeworkers.

Distributing points dynamically helps to load balance work between workers.Computing local optima from initial points will result in varying execution times.Without dynamical work balancing, the workers that finish early would simplywait until the rest are done. How many points and what points a worker willreceive will vary between different runs, and is not known in advance.

4.1.3.3 Patternsearch Parallelism

Direct Search methods, like Patternsearch in MATLAB, can utilize parallelismfor computing the objective and constraint function for every point in the pat-tern. There are several patterns available, where the pattern used in this reportis called GPSpostiveBasis2N ; the points in the pattern are located in the posi-tive and negative component directions of the vector x.

Patternsearch also contains a parfor -loop. All points within the mesh are dy-namically distributed to all workers in every iteration (Fig. 6).

Figure 6: Patternsearch evaluates points in the current estimate’s vicinity.

21

Page 22: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

4.2 Cloud Computing

Cloud Computing is a term widely used in media and is often associated withservices available to the public, like Dropbox, Google Drive or Microsoft 365.A proper definition of Cloud Computing is given by the National Institute ofStandards and Technology (NIST)[13].

Cloud definition: Cloud computing is a model for highly accessible, on-demand, computer resources (e.g. networks, storage, servers and services) thatcan be provisioned and released by a customer with minimal effort. The defini-tion contains 5 essential characteristics:

1. On-demand self-service. A user can single handedly provision computerresources without requiring human interaction with the service-provider.

2. Broad network access. Services are available over the network and aresupported by a wide range of platforms (e.g. laptops, mobile phones, thinclients and workstations).

3. Resource pooling. Computing resources are viewed as a pool serving sev-eral users using a multi-tenant model. Physical and virtual resources aredynamically provisioned according to the user’s demand, without the usernecessarily knowing the exact location of the resources.

4. Rapid elasticity. Services can elastically be provisioned and released,sometimes automatically, to meet increasing or decreasing capacity de-mands.

5. Measured service. Used resources can be monitored providing transparencyfor customer and provider.

The model revolves around the concept of services, where the customer paysfor computer resources when utilized (service), and not for the actual hardware(goods); this differentiates a Cloud infrastructure from e.g. a local cluster[14,p.4-5]. There are three service models associated with cloud computing: Infras-tructure as a Service (IaaS), Platform as a Service (PaaS) and Software as aService (SaaS)

Infrastructure as a service (IaaS): IaaS enables the customer to provisionvirtual machines (VMs), storage and network to configure as they wish. Forexample, a cluster could be created in the cloud by grouping virtual machines,where the service would be the utilization of the cluster. Providers are e.g.Amazon EC2 and Rackspace.

Platform as a service (PaaS): PaaS offers the customer a platform for run-ning their own applications. Here the service consists of delivering a softwareenvironment (operating system, libraries, etc.) in which the customers can mod-ify and run their own application. An example would be Google App Engine

Software as a service (SaaS): SaaS is the highest abstraction between hard-ware and customer. Here the service consists of delivering an application. Atypical SaaS application would be a mail client like Gmail, providing computer

22

Page 23: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

resources indirectly through the software.

4.2.1 Infrastructure

Cloud infrastructure revolves around distributing computer resources withoutdedicating physical hardware to a single tenant, a technique called Virtualiza-tion. This is possible by creating a virtual layer above the physical resources.The customer is provisioned Virtual Machines (VMs) that have access to aspecified amount of computer resources.[14, p.7]. VMs can be configured to begrouped or simply run stand alone, as seen in figure 7.

Figure 7: Cloud services offer computer resources through Virtual Machinesthat have access to physical resources like storage (HD), computing power, etc.

The infrastructure gives rise to the characteristics of Cloud computing, ad-dressed earlier. But there are also inherited problems. Heterogeneous resources,meaning that underlying hardware will most likely vary between VMs, couldhave an impact on performance. Providers initially beginning with a homo-geneous infrastructure will eventually get a heterogeneous system, caused bygradual upgrades of hardware; as was the case for instances (VMs) using Ama-zon EC2 in 2012 [15]. This could potentially affect the performance of parallelapplications, especially for those utilizing several VMs in a cluster.

There are more issues concerning the use of cloud services that fall into thedata security and data management categories. Moving a scientific applicationoutside the closed environment of a company and into the Cloud potentiallyexposes confidential models and data to the public. There is also the issue withloosing data due to hardware failures, legislations, hacking attempts, etc. In Ap-pendix A a thorough study, conducted as a part of this work, assesses storageand security concerns for both Cloud Computing in general, and for AmazonEC2 in particular.

In summary Cloud Computing aims to offer computing resources as an utility

23

Page 24: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

available at any time anywhere, removing the need for a customer to set up andmanage hardware. Providers offer highly scalable, provisioned resources thatquickly can cope with varying workloads. The different service models allow awider utilization of the resources, ranging from individual PC users, storing dataon Dropbox, to companies running grid-dependent Scientific HPC applications.

5 The Software - Analysis

The hot rolling software in this report is called the Adaptive Dimension Model(ADM) and is developed at ABB. Prior to this report the application couldperform global optimization on a set of objective functions using the global op-timization solvers MultiStart, GlobalSearch and Patternsearch, as described in[16].

In this section a thorough analyzis of the ADM is performed. This includesidentifying and understanding performance bottlenecks and analyzing the op-timization problem (convergence and discretization). Fully understanding thecomputational nature of the software is vital before implementing parallelism(Sect. 4.1).

5.1 The model and its performance

The ADM is categorized as a constrained, non-linear and non-convex optimiza-tion problem, with a 33 dimensional x vector and a set of objective functionsf(x). The optimization vector x governs the tensions on the billet (betweenstands), the gap sizes (between rolls), etc. Table 1 specifies all the 33 dimen-sions of x.

Table 1: The physical meaning of the components in x.

Component Description

1 Entry speed of billet

2-12 Inter-pass tension on metal

13-22 Gap size between rolls

23-32 Velocity of rolls

33 Billet temperature

There are several objective functions that describe different quantities of themodel. Two objective functions will be of special interest in this report, namelyobjective function 71 and 63, from here on denoted obj. 71 and obj. 63. Forcertain parallelism implementations of the ADM more objective functions willbe used. Table 2 describes the objective functions.

24

Page 25: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

Table 2: List of objective functions used.

Objective Fun. Description

Obj. 71 Grain size of metal in the end product

Obj. 67 Exit temperature of metal

Obj. 64 Rolling Power targets

Obj. 63 Specific power (power/production speed)

At its core the ADM computes temperature distributions and deformations of abillet passing through the stands of the rolling process. Analyzing which func-tions that contribute the most to the software’s execution time is importantwhen dividing tasks in parallel. MATLAB has a built in profiling tool tailoredfor this task. Profiling tools measure the execution time of all functions andchild-function of a program[17].

The profiling results for the ADM indicate that the majority of the executiontime is spent in functions computing the temperature distribution. Figure 8 isa screenshot from the profiler clearly identifying three computationally heavyfunctions: JE extract NEW, JE termo NEW and linspace.

Figure 8: Profiling results of ADM. Three functions constitute the majority ofthe total execution time.

The function JE extract NEW() prepares data structures to JE termo NEW(),that then performs temperature computations. Linspace is a part of both func-tions and is used to create linearly spaced vectors and matrices. As seen infigure 8 linspace is called 1.4 million times, stressing the memory of the system.

25

Page 26: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

5.2 Code improvements

Improving the code will naturally result in shorter execution times, but it willalso determine the efficiency of the parallel program. The serial part of theprogram will limit the obtainable parallel performance (Theor. 1), so makingthe serial part as fast as possible is preferred. In this section three significantimprovements to the ADM is applied; improving memory efficiency, removingredundant computations and to MEX critical functions.

5.2.1 Memory efficiency

Memory can roughly be divided in cache (on-chip memory) and RAM (physicalmemory). The cache is the fastest of the two and requests data from the RAM.The cache performs the best when consistent blocks of data is loaded from theRAM, completely filling it. However when data does not fit into cache or ifdata is spread out in RAM, the utilization of the cache will drop and limit theperformance.

Code suffering from memory bottlenecks can be improved in several ways. MAT-LAB is designed to store data in columns, meaning that a column-vector’s el-ements are stored contiguously in memory, while a row-vector’s elements arespread out in memory. There are also vectorized operations that efficientlyload and compute entire vectors at once[18]. A simple demonstration of non-vectorized code versus vectorized code:

1 %Non-vectorized

2 for i=1:1e06

3 data(i) = sin(x(i))*B(i);

4 end

5 %Vectorized - the .* flags for component-wise(.) multiplication(*).

6 data = sin(x).*B;

The execution time of the above code was 1 second for the vectorized codeand 50 seconds for the non-vectorized code. Hence the vectorized code was 50xfaster. This clearly illustrates how important efficient memory utilization is.

5.2.2 Redundant computations

When solving the ADM optimization problem in MATLAB the computation-ally heavy objective function and constraint function are called in successionmultiple times. Both functions compute the rolling model, which constitute asignificant part of their execution time. Since both functions are called for everynew input vector x, the rolling model is computed twice for the same x. Usingglobal variables in MATLAB, the redundant computations can be avoided bysharing the model data between the functions.

A variable in MATLAB, defined as global, is available in every function. Thisis useful when the amount of data is large and constant between calls or if thedata is needed later in the code. Applying this concept to the ADM effectivelyreduces the execution time by roughly 50 % regardless of optimization method

26

Page 27: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

(Fig. 9).

Figure 9: Removing redundant computations with global variables cuts theexecution time of the ADM by 50%. Here the local method fmincon confirmsthis.

Another benefit of global variables is that less time is spent on communicat-ing data. This is important when working with parallel computations, whereWorkers require data that is supplied by the host. Severe communication delaysquickly develop if there is a lot of data being sent back and forth. Sending ini-tial data once, and storing it with global variables, reduces communication time.

Improving memory efficiency of the ADM and removing redundant computa-tions resulted in an execution time nearly 4 times faster than the original code.Profiling results shown in figure 10 are greatly reduced in compared to figure 8.

5.2.3 MEX-Files

Finally, there is the possibility to create MEX (MATLAB Executable) files forcomputational demanding functions of a program. To MEX means replacing aMATLAB function with a compiled C/C++ or FORTRAN version of the samefunction [19]. Other languages like C can offer a speedup of the software, espe-cially when parts of the code cannot be vectorized.

The ADM contains AD termo prep NEW that calls both JE termo NEW andJE extract NEW. Parts of these functions can not be vectorized. For exampleloops where computations require data from previous iterations. This makesAD termo prep NEW a good candidate to be MEXed. The C-language and theC-compiler will improve the non-vectorized parts of the code. The compiler’soptimization flags can do automatic inlining of functions and loop unroling to

27

Page 28: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

Figure 10: Profiling results of improved ADM. The top 5 most time consumingfunctions, where the self-time is listed in the last column.

improve the utilization of memory and CPU. MEXing is done through MAT-LAB and uses the compilation flags by default. For more information regardingC-compile optimization, see [20]

The final profiling of the improved ADM, now also with MEX-files, is seenin figure 11. Compared with the original code the execution time is reducedroughly 13 times.

Figure 11: Final profiling of the ADM with MEX, showing the three top timeconsuming functions.

5.3 Parameter study

This study contains two parts. The first part concerns discretization of thetemperature field and how it affects the found global minimum and executiontime of the ADM. The second part consists of investigating how different ini-tial points converge towards a single minimum or multiple minima. This givesinsight into how different initial points affect the execution time of the modeland what to expect when evaluating many points.

5.3.1 Discretization of the temperature field

Choosing the appropriate degree of discretization is important for the accuracyand execution time of the software. The ADM discretizises the temperature by

28

Page 29: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

dividing the billet into two zones called the deformation and interpass zones.Every zone consists of a set of disks that represent a part of the complete billet.A disk is modelled as a circle with polar coordinates. FdJ is a discretizationparameter that divides the radial distance into fdJ computing points, as seenin figure 12.

Figure 12: The discretization of the temperature field is constructed by dividingthe billet into two zones, with a number of disks each.

The second important discretization parameter is called fdM and sets how manytime steps that are performed. So in essence fdJ and fdM constitute a compu-tation grid of the size (fdJ,fdM ) for computing the temperature field.

Performing parameter sweeps of fdJ and fdM when running the local optimiza-tion method fmincon, for the same feasible inital point x, will give informationabout the accuracy of the solution and the execution time. Results are presentedin Appendix B.1.

5.3.2 Convergence study

Stressing the ADM by running several random initial points gives interestinginformation of how the execution time will be for MultiStart, GlobalSearch orPatternsearch. This can be investigated by creating a large set of initial pointsthat are randomly distributed between the bounds of x.

The input variable x is a 33 dimensional vector limited by an upper bound anda lower bound. The vector will converge toward the minimum in a finite numberof iterations. Figure 13 illustrates the bounds on x and the iterative process fora single run of fmincon with obj. 63.

The green line in figure 13 illustrates the initial point sent to fmincon. By gen-erating new initial points at random within the defined bounds, the model willbe stressed. Results show how many iterations that are required on average toreach a local minimum; what local minimums that exist; and the distributionof which points that reach what value. The summarized conclusions are statedbelow. For detailed results see Appendix B.2.

29

Page 30: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

Figure 13: The components of x are plotted for every iteration of fmincon. Theupper and lower bounds are dotted and dashed, respectively.

5.3.3 Conclusions

The following lists conclude the study of discretization and convergence, de-scribing highlights and discoveries that are important for the parallelization ofthe ADM. The conclusions are based on the results in Appendices B.1 and B.2.

Conclusions from discretization study:

• The discretization characteristics of the ADM are known and sat-isfactory values for fdJ, fdM and the number of disks have been chosenfor the parallel tests. Values: fdJ = 11, fdM = 6, disk in defzone is 2,and disks in interpass zone is 60. Increasing the discretization parametersbeyond the satisfactory values will result in a small increase in accuracyand a large increase in execution time.

• Optimization algorithms(SQP, Active-set, Interior-point) result in dif-ferent execution times of fmincon, where Active-set is the fastest. Also,Interior-Point is removed from further studies because of poor accuracyin finding the minimum and long execution times.

• Objective functions take different time to compute. Obj. 71 re-quires less iterations to find the minimum in comparison to obj. 63; thetemperature field calculations are equally hard to compute, but obj. 63requires more evaluations.

Conclusions from the convergence study:

• An initial point will fail in roughly 15-20 % of the time. Randomlydistributed initial points, within bounds, fail when non-physical results areencountered while optimizing.

• The choice of algorithm affects the obtained solutions. SQP hasa higher probability than Active-set in finding the global minimum; thisis true for both obj. 71 and obj. 63. The difference in the probability offinding the global minimum is significant for obj. 63 (AS :29 % comparedto SQP :85 %). Also, SQP is more accurate in finding the global minimum,in comparison to Active-set.

30

Page 31: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

• Different initial points result in different solution times. Mainlydue to the number of iterations required for every initial point. Obj. 71can for certain initial points spike in the number of iterations required tohalt fmincon, where the solution obtained is usually far from the globalminimum. Obj 63 has no ’spikes’ but has a higher average number ofiterations.

• Only one local minimum was found for each objective function.This could question the use of global optimization. However, global op-timization is still important since there exists multiple ’problem points’,i.e. points that partly converge to another objective value, or that fail al-together. Also, this concludes only two objective functions out of severalavailable within the ADM; another objective function could have severallocal minima.

• The occurrence of ’problem points’ will greatly affect the work-load of all global solvers. These points greatly extend the executiontime of the software, making it harder to load-balance when parallelizing.

5.4 Software Parallelism

In this section the parallelism of the software is implemented and tested onABB’s Cluster, Leo (6). The results will help analyze the obtained parallelperformance of the software, and be an important reference when comparingresults obtained on the Amazon EC2 Cloud (6).

5.4.1 The workload of the software

The workload of a software is an important quantity when discussing paral-lelism. It is correlated to execution time in the sense that a larger workloadequals a longer execution time. Depending on what optimization method thatis chosen the composition of the workload can greatly vary (Fig.14).

Figure 14: The workload for the complete ADM software with respect to differ-ent optimization methods.

Regardless of what optimization method is used there will be a constant ini-tialization and evaluation part in the software. These parts are included in theserial fraction of the software (Fig. 5). The size of the workload depends on the

31

Page 32: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

choice of optimization method and on what parameters the method uses. Forexample, for multistart the number of initial points will affect the workload.

32

Page 33: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

Workload composition

• fmincon: Finds a local minimum for a single initial point. The major partof the work is spent on computing the objective function and constraintfunction (Fig. 11) many times until convergence.

• MultiStart : Every initial point is run by fmincon, hence the workload isroughly the number of initial points multiplied with the execution time ofone fmincon run. However, the convergence study (sec: ??) shows thatthe execution time of a single fmincon run can vary significantly.

• GlobalSearch: The workload is a combination of searching for candidatepoints and running fmincon. The search strategy is stochastic which makesit hard to predict the number of fmincon runs. and hence the workload.The size of the workload largely depends on the number of fmincon runs.

• Patternsearch: The workload consists of the number of objective and con-straint function evaluations required for convergence. The number of eval-uations are related to the choice of pattern and the search strategy.

5.4.2 Implementation of parallelism

Where to introduce parallelism in the software depends on the optimizationmethod and the number of objective functions that are of interest. In this sec-tion every method’s parallelism is illustrated and tested on ABB’s Cluster Leo(Sect. 6).

A set of tests are constructed to measure how well optimization methods scalewith increasing parallel resources. Since the parallel performance is affectedby the composition of the workload, and the workload depends on optimiza-tion characteristics, all tests include different objective functions and algo-rithms(Sect. 5.3.3). All tests are performed 5 times, averaged, and presented inthis section using parallel performance metrics (Sect. 4.1.1).

5.4.2.1 fmincon and GlobalSearch

fmincon, and hence also GlobalSearch, support the parallelization of gradientcomputations (Sect. 4.1.3.1). This means parts of fmincon’s workload is dividedamong several Workers (Fig. 15).

Figure 15: fmincon and globalsearch can distribute the computations of gradientcomponents to several workers, while converging toward a minimum.

33

Page 34: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

Figure 16: Results from running fmincon for the same feasible initial point x,while varying test parameters: algorithm, obj. fun. and number of workers.

Figure 17: Results for GlobalSearch running 500 candidate points while varyingtest parameters: algorithm, obj. fun. and number of workers.

5.4.2.2 MultiStart

MultiStart divides the initial points, and hence a large part of the software work-load, dynamically to workers (Sect. 4.1.3.2). This allows for several fmincon tobe run simultaneously (Fig. 18).

34

Page 35: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

Figure 18: Parallelism by running several fmincon at once.

Figure 19: Results when running MultiStart with 24 initial points.

5.4.2.3 Patternsearch

Patternsearch iterates towards a global minimum by evaluating patterns withpoints in every positive and negative component direction. In every point thecomputation of the objective and constraint functions is an independent task,hence the mesh evaluation can be divided among workers (Fig. 20). For moreinformation on patternsearch parallelism see section 4.1.3.3.

5.4.2.4 Performance analysis

The parallel performance of the software varies significantly for different opti-mization methods and algorithms, where the amount of parallel work in relationto the serial work is important. By measuring the serial execution time of theADM (Fig. 14) and the execution time of the optimization method, a naıveupper bound for speedup can be calculated (Eq. 10).

In table 3 the obtained speedup for a method is compared with its theoreticalmaximum speedup. The listed percentages represent the fraction of the total

35

Page 36: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

Figure 20: One level parallelism of Patternsearch.

Figure 21: Results when running Patternsearch, limited to 100 iterations.

Table 3: Actual speedup compared with theoretical speedup-limit when using11 workers. Data is based on Figures: 16, 17, 19, 21 (obj. 71 and SQP).

Method % of runtime Theor. Speedup Act. Speedup Efficiency

fmincon 69.6 % 2.7x 1.9x 17.3 %

multistart 98.4 % 9.5x 6.1x 55.5 %

globalsearch 98.9 % 9.9x 3.9x 35.5 %

patternsearch 99.4 % 10.4x 5.8x 52.7 %

execution time executed by the optimization methods. Since the methods im-plement parallelism differently, and contain serial parts, the theoretical speedupis called naıve and reflects the maximum speedup if the method could be com-pletely parallelized.

The methods MultiStart and Patternsearch scale better than the other methodswith increasing number of workers. Both methods constitute the major part of

36

Page 37: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

the total execution time of the software and therefore have large theoreticalspeedups. Also, the methods are parallelizable to a greater degree than fminconand globalsearch, since the efficiency is higher. The combination of large work-loads and efficient parallelism is the explanation behind the obtained results.

The optimization characteristics affect the parallel performance of the methods.It was concluded from the parameter study (Sec: 5.3.3) that obj. 63 requiresmore iterations than obj. 71 when running fmincon. This affects the executiontime of the methods, but not the execution time of the remaining software.Hence, the serial fraction will be less for obj. 63 than for obj. 71 when runningfmincon. The results in figures 16 and 19 show that the speedup is consistentlylarger for obj. 63 in comparison to obj. 71, regardless of algorithm. Globalsearchshows the same tendencies, but the speedup is influenced by the algorithm to alarge degree (Fig. 17); fmincon constitutes a smaller part of the globalsearch’stotal workload.

The choice between SQP and Active-set (Sec: 3.4.1) affects the execution timeof a single fmincon run. It has been shown in [3] that the parallel fraction offmincon depends on the choice of algorithm. It concludes that fmincon withSQP has a parallel fraction of 48.7 % and with Active-set 87.2 %. This meansthat Active-set should have larger speedup than SQP when run in parallel. Thisholds true for multistart (Fig. 19), but not for a single run of fmincon (Fig.16) or globalsearch (Fig. 17). The reason could be that SQP has, on average,longer execution times than Active-set.

5.4.3 Parallel limitations

The theoretical limits presented in the previous subsection are, as already stated,naıve. There are several causes lowering the obtained speedup, like communi-cation delays, load balancing issues, etc. This section addresses some of them,by stressing the ABB cluster’s resources (network, CPU:s, hard drives) by in-creasing the number of MATLAB Workers.

5.4.3.1 Parallel Overhead

Overhead caused by increased network communication, allocation for morenodes and the initialization of workers can drastically destroy parallel perfor-mance (Sec. 4.1.2.1). A test demonstrating the sharing of cluster resourceswith several users on the ABB cluster is constructed. The results show that thestart-up time for the matlabpool (allocation of parallel resources) worsens withincreasing number of provisioned workers (Fig. 22). Changing to a dedicatednode will remove the competition of resources (Fig. 23).

37

Page 38: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

Figure 22: Shared node with high number of users. Stacked bars illustratingexecution time for different parts of the ADM.

Figure 23: Dedicated node. Stacked bars illustrating execution time for differentparts of the ADM.

Network communication adds extra time to the overall execution time. Theratio between the time to send data and the worker’s runtime needs to below for efficient parallelization. Considering how the available methods useparallelism (Sect. 4.1.3.1) multistart is the method with the lowest ratio. Herethe time of sending data is in insignificant to running a complete fmincon runon the worker. For the other methods every communication results in a coupleof evaluations of the objective and constraint functions. When increasing thenumber of available workers to 40, network communication becomes dominantfor fmincon, globalsearch and patternsearch (Fig. 24), but not for multistart (Fig. 25). This manifests as a decrease in speedup, which occurs in the range of20-40 Workers.

38

Page 39: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

Figure 24: Methods fmincon, globalsearch and patternsearch for larger numberof workers.

Figure 25: Multistart for different number of initial points (pts) and for a largernumber of workers.

5.4.3.2 Granularity

A method’s workload can be divided into smaller, independent, parts and is con-sidered the granularity of the method. Parallel performance will not improvewhen using more workers than the method’s granularity. For example, runningmultistart for 24 initial points will limit the number of workers to 24. Ignoringthe granularity of the method will result in a speedup plateau where idle workerswill exist but not contribute in optimization. Figure 25 illustrates this situationwhen using 24 initial points. The methods have different granularities, and arepresented in table 4.

39

Page 40: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

Table 4: Granularity of the different methods. The optimization vector x con-tains 33 components.

Method Granularity Description

fmincon 33 1 component of x per worker (gradient comp.)

multistart num. of init. pts 1 initial point per worker

globalsearch 33 Same as fmincon

patternsearch 66 (2*33) Pattern consists of 33*2 points (GPSpostiveBasis2N)

A plateau will occur for all methods when the number of workers exceeds thegranularity, tendencies of this is seen for fmincon and globalsearch in figure 24(33 workers and above).

5.4.3.3 Load balancing

In all methods a parfor -loop attempts to load balance work to the workersthrough dynamic distribution. The aim is to keep workers utilized as muchas possible during the total execution of the software. Proper load balancingequates to parallel efficiency.

Increasing the number of workers for a method, while keeping the number ofinitial points per worker constant, is called weak scaling; dividing the paralleltime with the serial time illustrates how efficiently the method scales. For mul-tistart a weak scaling test, when assigning 2, 4 or 6 initial points per worker,shows the impact of load balancing (Fig, 26).

Figure 26: Running Multistart where the number of initial points per worker isconstant. Increase in y-axis represents a % drop in efficiency.

It is known from the parameter study (Sect. 5.3.3) that obj. 71 for certainpoints can exhibit fmincon runs with high iteration counts. This, intuitively,should affect the load balancing, since certain workers will receive considerably

40

Page 41: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

higher workloads than the rest. This is confirmed in figure 26, where severaltroublesome points worsen the load balancing with obj. 71. Obj. 63 has a moreeven average iteration count per fmincon run and hence has better efficiency.

Increasing the ratio between the number of initial points to be solved and thetotal number of workers will allow the parfor -loop to utilize workers for a largerfraction of the parallel execution time.

5.5 Summary of analysis

The optimization methods exhibit different parallel performance, where Multi-Start is the recommended method. Its good performance is due to the smallratio of serial to parallel parts, and that granularity is equal to the numberof initial points. The obtained performance is also affected by what objectivefunction that is optimized and what algorithm that is used. For example, obj.71 contains ’spikes’ (Sec. B.2), while obj. 63 doesn’t. Weak scaling (Fig. 26)illustrates how efficiency is worse for obj. 71 than obj.63, but is improved ifmore initial points are used.

For all methods only one local minimum was found per objective function. Thiscould question the use of global methods. However, the occurrence of initialpoints causing ’spikes’ and other ’problem points’ could prevent a local methodfrom finding the local minimum and a global method would mitigate that risk.

6 Computer resources

All computer resources used in this report are presented in table 5. Passmark2

score is a linear grading system for the performance of different CPU:s and isincluded to relate performance between hardware.

Table 5: Specifications for all computer resources used throughout this report.

Name CPU RAM Network Cost

LaptopIntel Core i5-2450M, 2 CPUs @ 2.50GHz

Passmark score: 3,4436 GB — —

ABB ClusterIntel Xeon E5-2670, 16 CPUs @ 2.6GHz

Passmark score: 12,86732 GB 1 Gbps —

Cloud (cc2.8x)Intel Xeon E5-2670, 16 CPUs @ 2.6GHz

Passmark score: 12,86760 GB 10 Gbps 2.3 $/h

Cloud (c3.8x)Intel Xeon E5-2680 v2, 16 CPUs @ 2.8GHz

Passmark score: 16,79960.5 GB 10 Gbps 1.9 $/h

Cloud (r3.8x)Intel Xeon E5-2670 v2, 16 CPUs @ 2.5GHz

Passmark score: 14,638244 GB 10 Gbps 3.1 $/h

2http://www.cpubenchmark.net/cpu test info.html

41

Page 42: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

7 Cloud Assessment - Method and Results

In this section the parallel performance of the software will be evaluated on theAmazon Elastic Compute Cloud (EC2). The different optimization methodswill be stressed and analysed. The aim is to compare the parallel performanceof the software for the cloud cluster and the ABB cluster.

The Amazon cloud infrastructure is built on virtualization techniques, wherevirtual machines, called Instances, are provisioned and interconnected to a vir-tual cluster. There are several instance types, designed for different hardwarerequirements, where a couple of them will be tested and analyzed from a cost-performance perspective. The pre-study of the Amazon EC2 infrastructure isavailable in Appendix A, and the complete specifications of used computer re-sources can be found in section 6.

7.1 Cloud Computing through Mathwork’s CloudCenter

Access to the Amazon EC2 infrastructure is primarily, in this report, doneusing Mathwork’s CloudCenter3; a web-service that automatically provisionsinstances, configures virtual clusters, and installs MATLAB on EC2. The serviceis limited to the instance types: cc2.8xlarge and cg1.4xlarge, and a maximumof 256 Workers. The cc2.8xlarge is one of Amazon’s computing instance typeswith similar hardware specifications to the ABB Cluster. The same range oftests that are done for the ABB cluster are also performed on the Mathwork’sCloudCenter.

7.1.1 Optimization method comparison

Figure 27: cc2 instances compared to ABB Cluster Leo for all optimizationmethods. Left plot speedup, right plot difference in runtime.

3http://www.mathworks.se/discovery/matlab-ec2.html

42

Page 43: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

Methods scale similarly on the cloud and on the ABB cluster, but where asignificant decrease in speedup develops for Leo when using 24 workers or more(Fig 27). The execution time of all methods are in general higher when run onthe cloud, regardless of algorithm, objective function or problem size (Fig. 27,28).

Figure 28: Radar plot comparing the execution time(s)(on the radial axis) ofthe cloud (blue) and Leo (red) for 1 worker.

7.1.2 Maximized number of workers

Increasing the number of workers eventually mitigates the change in speedup,for all methods (Fig. 29, 30).

Figure 29: Scaling from 1 to 256 workers for the methods: fmincon, GS, PS.

43

Page 44: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

Figure 30: Scaling from 1 to 256 workers for Multistart (24 pts, 48 pts, 96 pts).

7.2 Comparison Cloud instances

There are two additional instance types suited for running the ADM. The firstinstance type, c3.8xlarge, has a faster processor and costs roughly 20% less perhour compared to cc2.8xlarge. The second instance type, r3.8xlarge, has fourtimes the memory and costs 35 % more compared to the cc2.8xlarge. The twoinstance types are chosen since they are designed for applications with differ-ent hardware requirements: computing power and memory, respectively. SinceMathwork’s CloudCenter is limited to the cc2.8xlarge, custom scripts were cre-ated to replace the functionality of CloudCenter.

Running fmincon and MultiStart for different number of workers the executiontime is compared between instance types and the ABB cluster. Results showthat the c3.8xlarge is comparable and sometimes even faster than the ABB clus-ter, while r3.8xlarge gives no significant improvement from its increased memorycapacity (Fig. 31).

Figure 31: Fmincon and MS. Different instance types and number of workers.

44

Page 45: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

7.3 Cloud bursting

The ability to quickly offload large workloads from the private computer to acluster is called bursting. Cloud bursting is a type of deployment model wherea hybrid cloud infrastructure, i.e. a private network with access to public cloudservices, is designed to cope with short peaks in computing capacity by offload-ing workload to the public cloud4.

The ADM with its many objective functions can rapidly spike in computingcapacity. For example, if the global optimum of four objective functions is tobe found using MultiStart with 64 initial points. To burst this type of heavytask would benefit a laptop user. Test results from bursting to different clustersshow that there are significant time improvements when compared to runningon a regular laptop. (Tab. 6).

Figure 32: Two level parallelism of the ADM software.

From a parallel perspective the most efficient approach when solving a problemis to strive for as high granularity as possible. Dividing the ADM into twoparallel levels where the objective functions are run simultaneously, and everyMultiStart optimization has access to a pool of workers, achieves this (Fig. 32).

Table 6: Run ADM using 128 workers. Workload: 4 obj. functions with 4multistart runs each. Every multistart has 64 initial points. A total of 256fmincon runs.

Runtime ABB Cluster cc2.8xlarge c3.8xlarge Laptop (serial)

Submit job 5 s 7 s 6 s -

Wait 130 s 170 s 140 s 5 h

Receive results 1 s 12 s 12 s -

Total 136 s 189 s 158 s 5 h

4http://searchcloudcomputing.techtarget.com/definition/cloud-bursting

45

Page 46: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

8 Cloud Assessment - Discussion

8.1 Cloud Performance

The ADM has the lowest execution times when run on the ABB cluster, re-gardless of method or optimization characteristics (Fig. 27, 28). The processorspecifications of the Cloud cluster (cc2) and ABB cluster are identical (Sect. 6).This indicates that virtualization5 could be the cause, since this is the largestdifference between the architectures. Communicating through the virtual layerwill accumulate latencies that eventually become significant. Expressing the per-formance degradation in relative terms, all methods show that running ADMon the ABB cluster is 18 % faster (std 3 %) than running it on the cloud(cc2.8xlarge, virtualization). Note that virtualization is suggested as a probablecause and not as proven fact.

Considering the performance of the other instance types (c3, r3), the loss inspeed due to virtualization can be compensated through better hardware. Thec3.8xlarge instance has a more powerful processor compared to the ABB clus-ter, reducing from 18 % to 3 % (std 3 %) faster execution time on the ABBcluster compared to c3.8xlarge instance (Fig. 31). For the r3.8xlarge instancethe corresponding number is 11 % (std 3 %). This indicates that the ADM isnot limited by the cluster’s memory capacity, but rather the computing powerof the CPU.

The scaling characteristics of the ADM are similar for all cluster architectures,using 24 workers and below. For more workers, all methods except MultiStartwill transcend from having the shortest execution times on the ABB cluster tohaving shorter execution times on the cloud (Fig. 27). This is due to increasednetwork communication when using multiple workers; especially affecting meth-ods where communication is continuous throughout the optimization process(fmincon, globalsearch and patternsearch).

The ABB cluster operates on a 1 Gbps (Gigabits per second) network band-width, while all tested cloud instances use a 10 Gbps network. Measuring thebandwidth allocation of parallel fmincon and patternsearch shows that everyworker takes roughly 48 Mbps of bandwidth.6 Hence, saturating the networkbandwidth requires ∼20 or more workers on the ABB cluster and ∼200 workersfor the AWS cloud instances. A saturated network will significantly affect theparallel performance of the software, as seen in figure 27, since workers will waituntil data is received before computing.

Increasing the number of workers to the maximum of 256 clearly identifies wherespeedup decreases (network saturation) and eventually flattens (Fig 29, 30). Thechange in speedup mitigates due to the granularity of the method, resulting inidle workers that don’t allocate network bandwidth. A tentative observation ofthe graphs show that the solvers’ granularity limits (Subsec. 5.4.3.2) correlate tothe number of workers, where speedup flattens. Finer measurements are neededto fully confirm this.

5On Cloud: Adding a virtual layer between hardware and software6Measured through NetHogs version 0.8.0 while running ADM with chosen method

46

Page 47: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

Bursting large workloads to the cloud is a viable and time efficient option.Running the ADM for 4 objective functions, Multistart and 64 initial pointsper objective function, took 5 hours of computation time on a typical lap-top. Bursting to the cloud reduced this time to ∼3 minutes. The total timeconsists of sending the work (submit job), waiting for the job to complete (com-putations), and returning the results. Comparing the cloud and ABB cluster,time differs in the wait time and receive time (Tab. 6). The difference in waittime is expected, considering the virtualization layer. Also, the time to receivedata seems reasonable. Sending 1-2 MB of data over the private network (ABBcluster) is many times faster than from the public cloud.

8.2 The ADM as a Service

The parallel software has been created according to the service model Softwareas a Service (SaaS, Sect. 4.2), in which the user provisions the software from thecloud when needed. The ADM software is accessed from two places in Amazon;Amazon S3 when the cloud is offline, and from a EBS storage device when thecloud is online. This choice of data management adds redundancy and secu-rity by storing the model encrypted and redundantly in Amazon S3 while offline.

In order to provision the model as a service a MATLAB installation is requiredon the client computer. The client software is a single MATLAB function thatcreates jobs with parameters that instruct the ADM software which objectivefunction to optimize and what method to use. Through Mathwork’s CloudCen-ter a seamless link supplies the client with the service from Amazon EC2. Thecomplete concept is illustrated in figure 33.

Figure 33: Conceptual illustration of the ADM as a service.

47

Page 48: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

8.2.1 Cloud considerations

Moving the ADM from the private network to the cloud is necessary in or-der to provide the software as a service. The relocation can potentially exposecompany secrets to the general public. This is always a concern with cloud com-puting and should be carefully considered before choosing cloud services overlocal computer resources.

From the literature-study on cloud security (App. A.3) made for this report,several security concerns where found related to the cloud. These are importantto assess when considering a cloud solution and are listed shortly below.

Security concerns:

• Network Security (spoofing, sniffing, firewalls, security config.)

• Interfaces (API, User interfaces, administration)

• Data Security (Encryption, redundancy, disposal)

• Virtualization (Isolation, hypervisor, data leakage)

• Governance (Lack of user data control, lock-in)

• Compliance (SLA, Loss of service, Audit)

• Legal issues (Subpoena laws, provider privilege)

Amazon is one of the largest cloud service providers at present, that constantlyimprove the security of their services. Many of the listed security concernshave direct solutions within the Amazon Web Services(AWS). For example,network security which is handled through sophisticated firewalls, called Secu-rity Groups, or access control, which is regulated through the IAM(Identity andAccess Management console). A thorough walk-through of common securityconcerns and how to mitigate them in AWS can be found in Appendix A.3.

Storage on the AWS sufficiently offers redundant storage of the software. Theoverall storage services are perceived as comparable and even more secure thancompany solutions. For example, redundant storage on multiple locations inAmazon S3 guarantees 99.99999999% data durability with 99,99 % availability7.Whereas a company’s private network most likely is a single isolated location,vulnerable to e.g. power outages.

Data management for the ADM, through CloudCenter, is elegantly constructed.When the cloud cluster is offline all data is moved and stored on an encryptedimage in Amazon S3 (snapshot), while the data in Amazon EC2 is destroyedtogether with the cluster. This improves redundancy by using S3 and improvessecurity by only decrypting the software when provisioned by the user. Notethat this solution is specific to Mathwork’s CloudCenter, where Amazon sup-ports several storage solutions. For more information regarding data manage-ment in AWS see Appendix A.2.

7https://aws.amazon.com/s3/faqs/

48

Page 49: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

8.2.2 Hybrid solutions

The ADM service model allows for the use of a hybrid cluster infrastructure,i.e. a composition of two or more clusters available to the user. The clientsoftware can effortlessly and simultaneously deploy jobs to the Amazon EC2cloud, through the ABB firewall, and to the ABB cluster on the private net-work. The solution is possible through the design of the client software, whichis not specific to the ADM software. Hence, this solution works for arbitraryparallel MATLAB software.

The hybrid infrastructure can be used to burst workloads, requiring many work-ers, to the cloud. This is useful considering that the ABB cluster is sharedamong many employees at ABB, where the desired computing power may notbe available. For example, the bursting example presented in section 7.3 (Tab.6) would require 128 workers. The ABB cluster contains 512 cores. Runningthe example would allocate 25 % of available computing resources, which is notacceptable.

8.3 Cost analysis

The cost of provisioning computer resources from the Amazon EC2 depends onseveral choices related to e.g. storage type, computing power, network traffic,etc. As for other cloud services, the computer resources are billed by the houror by the amount occupied ($/GB of storage).

The largest cost is associated with the provisioning of instances, where the choiceof instance type is important. The three instance types tested for performance(Fig. 31) vary in their cost per hour (Fig. 34). It is clear that the c3.8xlargeis the best choice in terms of both cost and computing performance for the ADM.

Figure 34: Cost for different instance types compared to obtained executiontime when running ADM with Multistart 32 workers and 30 initial points.

49

Page 50: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

9 Conclusions

This report proves the use of the ADM as a Service, supplying global optimiza-tion in parallel from the Amazon EC2. Parallel performance from Amazon EC2is comparable to a similar on-site cluster. Two factors related to performancewere identified that set the two cluster architectures apart; Network bandwidth,to the disadvantage of the on-site cluster (ABB), and virtualization to the dis-advantage of the cloud cluster. In particular, the on-site cluster executed theADM 18 % (± 3 %) faster than the cloud cluster (cc2.8xlarge), regardless ofmethod, for a number of workers less than 20.

From a cost-performance perspective, the c3.8xlarge instance was found to havethe best performance and the lowest cost per hour for running the ADM soft-ware.8

Multistart is the most efficient method when run in parallel. The analysis andparallelization of the ADM revealed hard limits on the speedup for all opti-mization methods. The most important limit is related to the ratio betweenserial and parallel workload, where MultiStart has the lowest ratio. Anotherlimitation is based on the granularity of the methods. The increase in speedupbecomes zero when the number of workers surpass the granularity of the method.MultiStart is the only method with dynamic granularity, where the granularityequals the number of initial points. The other methods are limited by the di-mensionality of x. Finally, latencies from network communication will becomesignificant and cause a decrease in speedup for increasing number of workers.The methods fmincon, globalsearch, and patternsearch require continuous net-work communication throughout the optimization process and thus suffer fromsignificant decreases in speedup. Multistart only sends initial points and returnsfmincon results, and is hence not affected by network latency to the same degree.

Finally, the use of cloud services offers a great supply of on-demand computerresources, effectively channelled through the ADM software. Bursting largeoptimization jobs, with multiple objective functions, show how time-frames havebeen shrunk from 5 hours to a few minutes, for a total cost of 30$. With thissaid, it is concluded that cloud computing is a promising service for manycustomers (large and small), with no start-up costs, and high availability.

8Mathwork’s CloudCenter has shortly before the final submission date of this report im-plemented the support of c3.8xlarge instance; a reasonable step considering getting morecomputing power for a smaller cost.

50

Page 51: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

9.1 Future research

9.1.1 Heterogeneity within Amazon EC2

Heterogeneity of computer resources within the Amazon EC2 could be problem-atic, since it reduces parallel performance of the software. For example, considerprovisioning several instances linked into a cluster. The developer designs thesoftware to evenly distribute workloads to all instances. Some instances will fallshort on the performance, due to older hardware, causing a negative impact onthe overall execution time.

A study performed 2012 called ”Exploiting Hardware Heterogeneity within theSame Instance Type of Amazon EC2 ” [15] found that ”...Amazon EC2 usesdiversified hardware to host the same type of instance. The hardware diversityresults in performance variation. In general, the variation between the fast in-stances and slow instances can reach 40% ” (Sect. Conclusions, [15]).

The investigated instances did not include the compute family (c3, cc2). Hence,the subject ”Analysis of heterogeneity within the the Amazon EC2compute family” is suggested as a future research area.

9.1.2 Replacing Mathwork’s CloudCenter

Mathwork’s CloudCenter provisions instances, creates the virtual cluster andinstalls MATLAB; all through a single click of a button. This greatly broadensthe target audience. However, CloudCenter limits the use and customization ofthe provisioned computer resources.

It is possible to replace the functionality of CloudCenter, using the availableAPIs (Application program interfaces) from Amazon. This was done in orderto run more instance types (c3, r3) and to test other Amazon features. Onenoteworthy functionality is called Spot instances, which allows the customerto provision instances to a considerably lower cost. By stating a max-bid, theinstance will be provisioned until the demand surpasses the set bid. The use ofSpot instances reduced the costs for some tests in this report by 90%.

Considering that a lot of useful features are not accessible through Mathwork’sCloudCenter, a future research area is to gather information on how to constructmore general web-services that can replace Mathwork’s CloudCenter.

51

Page 52: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

References

[1] Y. Sharikov et al. 201. MAPAS: A tool for predicting membrane-contactingprotein surfaces. Nature Methods 5(2): 119.

[2] N. Antonopoulos, L. Gillam 2008. Cloud Computing, Principles, Systemsand Applications. Springer, Introduction p. 8

[3] Sunde, M (2013) Cloud Optimization for Hot Rolling MSc thesis. UppsalaUniversity, ABB CRC Sweden

[4] Yelick k. et al.(2011) The Magellan Report on Cloud Computing for ScienceReport. U.S Department of Energy.

[5] Voss et al.An elastic infrastructure for research applications(ELVIRA) 2013:Journal of Cloud Computing, Springer.

[6] IRTC (2003). 22nd international rolling technology course, documentation,Industrial Automation Sevices, lecture 2 p.7.

[7] Daneryd, A. Models and optimization for rod and wire rolling and ROTcooling (2013), ABB CRC Sweden.

[8] Nocedal, J. and Wright, S. Numerical Optimization Second Edition(2006),Springer.

[9] Kuhn, H.; Tucker, A. W.(1951) Nonlinear programming Proceedings of 2ndBerkeley Symposium. Berkeley: University of California Press. pp. 481-492.

[10] MathWorks (2014a) Optimization Toolbox User’s Guidehttp: // www. mathworks. co. uk/ help/ pdf_ doc/ optim/ optim_ tb. pdf

2014-06-01

[11] MathWorks (2014a) Global Optimization Toolbox User’s Guidehttp: // www. mathworks. com/ help/ pdf_ doc/ gads/ gads_ tb. pdf

2014-06-01

[12] G.M. Amdahl, Validity of the single-processor approach to achieving largescale computing capabilities. In AFIPS Conference Proceedings, vol. 30 (At-lantic City, N.J., Apr. 18-20). AFIPS Press, Reston, Va., 1967, pp. 483-485.

[13] Mell, P; Grance, T (2011) The NIST Definition of Cloud Computing ,Na-tional Institute of Standards and Technology USA

[14] Cafaro, M. and Aloisio, G. 2010, Grids, Clouds and Virtualization. Springer

[15] Ou, Z. et al. (2012) Exploiting Hardware Heterogeneity within the SameInstance Type of Amazon EC2. Aalto University, Finland; Deutsch TelekomLaboratories, Germany.

[16] Saxen, A. and Bernander, K. Parallel Global Optimization of ABB’s metalprocess models using Matlab 2014: Uppsala University.

[17] MathWorks (2014a) MATLAB profilerhttp: // www. mathworks. se/ help/ matlab/ ref/ profile. html

2014-06-01

52

Page 53: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

[18] MathWorks (2014a) MATLAB memory optimizationhttp: // www. mathworks. se/ company/ newsletters/ articles/

programming-patterns-maximizing-code-performance-by-optimizing-memory-access.

html 2014-06-01

[19] MathWorks (2014a) Introducing MEX-Files MATLABhttp: // www. mathworks. se/ help/ matlab/ matlab_ external/

introducing-mex-files. html

2014-06-01

[20] GNU gcc Optimization flagshttp: // gcc. gnu. org/ onlinedocs/ gcc/ Optimize-Options. html

2014-06-01

53

Page 54: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

A Study of data storage and security on cloud

A.1 Introduction

The following section contains information concerning data management andsecurity on the cloud in general and for Amazon EC2 in particular. AmazonElastic compute cloud (EC2) is a service offered as a part of Amazon WebServices(AWS), providing on-demand and resizeable computer resources fromthe cloud. A range of subjects are presented in order to give the reader anoverview of various important topics.

A.1.1 Environment

The Amazon Web Service(AWS) environment offers a simple and complete con-trol of computing resources, user management, storage management and moni-tor tools. The customer can obtain computing resources and begin computationsin the matter of minutes.

The infrastructure of Amazon EC2 consists of key concepts that are importantto understand. Hardware available to a customer is commonly shared with othercustomers. A logical isolation is obtained by creating Virtual Machines(VMs)that act as separate servers while still on the same hardware; Amazon’s VirtualMachines are called Instances. When starting an Instance the customer canconfigure the Instance type(hardware requirements), what kind of storage to useand what software to launch. The software is packaged into an Amazon Machineimage(AMI). The customer can easily launch any number of instances, linkingthem together to form clusters. An illustration of the Amazon Web Servicescan be seen in figure 35.

Figure 35: Illustration of Amazon Web Services.

Amazon EC2 is a service available to customers all over the world. Amazon’sservers are distributed on a number of geographically strategic locations, called

54

Page 55: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

Regions. Every region has a number of isolated locations known as Avail-ability Zones(AZ), as seen in figure 36. Every Availability Zone provisionscomputing power, storage and network capabilities. The customer can choosein what region and AZ to launch Instances and store data. Amazon recom-mends storing and running Instances on several AZs since hardware failures areisolated to single AZs.

Figure 36: Illustration Amazon EC2 Regions and Availability Zones.

A.1.2 Placement group

Placement group is an important feature that enables a logical grouping of in-stances within a single Availability Zone.9 Instances within the same PlacementGroup benefit of full-bisection bandwidth and low-latency network performance.Amazon does not specify if Instances on a single Placement Group are locatedon the same physically host. Amazon recommends this type of grouping fortightly coupled applications, i.e. applications modules that highly depend oneach other, and node-to-node communication, typical of HPC applications.

Placement groups are limited by the following:

• A placement group can only be created and used within one AvailabilityZone. Hence spanning one placement group to include several instancesfrom different Availability Zones is not possible.

• Placement Groups can not be merged. To regroup instances they firstneed to be terminated.

• Only a few of the available instance types support launch into a PlacementGroup. These are: c3.large, cr.xlarge, c3.2xlarge, c3.4xlarge, c3.8xlarge,cc2.8xlarge, cg1.4xlarge, g2.2xlarge, hi1.4xlarge, hs1.8xlarge, i2.xlarge,i2.2xlarge, i2.4xlarge

A.2 Data Management

Data management refers to how data is stored, processed and disposed of. ForAmazon EC2 customers there exist several solutions tailored to their require-ments on performance, redundancy, security, etc.

9http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html

55

Page 56: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

A.2.1 Storage types

There are three types of data storage available on Amazon EC2 called Ama-zon Elastic Block Storage(EBS), Amazon Instance Store, and AmazonSimple Storage Service(S3). In this section their main characteristics arepresented, and a guideline is given on when to use which storage type. Thedepiction, seen in figure 37, visualizes the three storage types.

Figure 37: Illustration of Amazon storage types. Arrows indicate data commu-nication.

A.2.1.1 Amazon Elastic Block Storage (EBS)

The Amazon EBS provides highly available and reliable storage, suited for awide range of applications. EBS volumes can be viewed as stand alone blockdevices (like hard drives) that are mounted, i.e. associated with, Instances.Each Instance can have several volumes mounted, but a single volume can onlybelong to one Instance. An EBS-volume can also be used as a root device foran Instance, a so called EBS-backed Instance, meaning that it hosts the AMI.

The main characteristics are

• Persistent storage: Data is safely stored before, during and after alaunch of an Instance.

• Storage capability: From 1 GB up to 1 TB of storage.

• Locality: Only accessible within an Availability Zone and can be attachedto any running Instance.

• Redundancy: Amazon EBS volumes are redundantly stored on multiplephysical locations without additional charge. However copies are storedwithin a single Availability Zone and Amazon recommends to backup usingthe SnapShot functionality. Backups are stored redundantly on AmazonS3.

• Stop Instances: EBS storage volumes can be used for booting the AMIand is then called an EBS-backed Instance. Instances booted with EBS

56

Page 57: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

can be stopped. Stopping the Instance allows the customer to quicklyrestart the Instance when needed.

Amazon EBS provides two storage types called Standard Volumes and Pro-visioned IOPS Volumes. The main difference lies in the provisioned In-put/Output Operations Per Second(IOPS), where greater IOPS are needed forapplications with intensive changes in stored data. The Standard Volumeis pitched as ”ideal for applications with light or bursty I/O requirements”, byAmazon. Promising data delivery rates of 100 IOPS on average, with burstcapabilities of up to hundreds of IOPS. Examples of common applications arefile servers and Low-traffic websites.

Provisioned IOPS Volumes are available at a small additional cost, offer-ing up to 4000 IOPS per volume. The volumes are designed for I/O intensiveworkloads, sensitive to consistency in throughput(data transfer rate). The ratiobetween volume size and provisioned IOPS equals 30. For example, a volumewith 3000 IOPS will need a volume size of at least 100 GB. Amazon guaranteesa deliverance within 10% of the provisioned IOPS 99,9% of the time over a year.Example of applications are large databases and business applications.

Amazon EBS Volume performance is affected by several factors. Some of themost important are listed here:

• Workload demand: There is a relationship between the performance of theEBS volume, the amount I/O request, and the latency of each request.Balancing the relationship is crucial in order to fully utilize the provisionedIOPS.

• Workloads that require minimal variability and dedicated traffic shouldstart EBS-optimized Instances or an instance supporting 10 Gigabit net-work connections.

• Pre-warming EBS Volumes, i.e. write data to the volumes before runningthe application, prevents a 5-50% reduction in IOPS of the initial run ofthe application.

• Instance types that support greater IOPS than provisioned by a single EBSVolume should use multiple EBS volumes, linked together using RAID 0,for higher IOPS.

A.2.1.2 Amazon Instance Store

Amazon Instance Store volumes are preferred when dealing with temporarystorage or data that changes frequently. The characteristics of the volume de-pend on the choice of Instance Type, but common for all Amazon Instance Storevolumes are:

• Temporary storage: Data on the volume exists only during the life timeof the associated Instance.

• Storage capability: From 150 GB up to 48 TB of storage.

57

Page 58: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

• Locality: Physically connected to the underlying hardware of the In-stances.

• Shared resource: Instance started on the same hardware will share theInstance Store disk subsystem.

Amazon Instance Store Volumes can host the AMI when launching an Instance,which is called Instance Store-backed Instances. This type of configurationshould be used when data preservation is not a concern; when high performanceis critical; or if the same data should be replicated across a set of Instances shar-ing the same hardware. Pre-warming Instance Store Volumes is only importantfor Instance types lacking SSD hard drive configurations.

A.2.1.3 Amazon Simple Storage Service (S3)

Amazon S3 provides reliable and cheap data storage. Amazon EC2 can useAmazon S3 for storing AMIs and Snapshots with high redundancy on a differentlocation than the Availability Zone, enabling quick recovery of data in case ofsystem failures. All data in Amazon S3 is stored in so called Buckets, linkedto the customers AWS-account.

• Storage capability: Any amount of data

• Slow data retrieval: Data will be considerably slower to retrieve, com-pared to other storage types, due to the off-location properties. HoweverAmazon states that retrieval latency is insignificant relative to Internetlatency; the storage is suited for Internet-applications that operate withsimilar latencies. Amazon S3 should primely be used for secure data stor-age under a longer time period.

• Redundancy: Amazon guarantees storage with 99.999999999% durabil-ity and with 99,99% availability.

Objects in Amazon S3 storage can easily be accessed through the bucket theyare contained in. For example, the key value abbData.zip maps to an objectstored in myawsbucket. The data is addressable using: http://myawsbucket.

s3.amazonaws.com/abbData.zip

A.3 Security

A.3.1 Introduction

Security in the cloud is an important aspect when considering the adoption ofa Cloud Service; handing over data and manageability control to a third partyand, in worse case (unintentionally), to the whole Internet. There are manyconcerns among scientists and companies regarding cloud security today. Toproperly grasp the spread of the concerns a categorization is used, suggestedby Gonzales et al10. The Amazon Web Services security is evaluated based onthese categories to give insight into the security of using Amazon as a CloudService provider.

10A quantitative analysis of current security concerns and solutions for cloud computing,Gonzalez et al, 2012

58

Page 59: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

A.3.2 Categories

A.3.2.1 Network Security

refers to protecting network communications and configurations of the cloudenvironment. A selection of known threats to transfer security are: Sniffing,Spoofing, man-in-the-middle, and side-channel attacks. All aimed at acquiringlogin credentials or other sensitive data by breaching the network communica-tion between client and the Cloud Service, as seen in figure 38.

Figure 38: Acquiring sensitive data by sniffing, spoofing, etc.

Regulating incoming traffic is important and is achieved using a firewall. Fire-walls govern what IP-addresses that are granted access and through what ports(SSH,HTTP, FTP) data can be transfered. Properly configured firewalls can preventDenial-of-Service(DoS) attacks, i.e. attempts to bring down services by over-loading server capacities, and detect hack attempts.

Amazon Web Services protects network communication through a rigorous se-curity system. The system consists of Virtual Private Networks(VPNs) trans-ferring encrypted data using Key-Pairs between the customer and cloud service.Firewalls called Security Groups exist to limit user access to Instances. Key-pairs consist of a public key stored on Amazon EC2 and a private key stored bythe customer. Together they authenticate secure VPN-connections, preventingthreats like Sniffing or Spoofing from succeeding. The keys used by AmazonEC2 are 1024-bit SSH-2 RSA keys. The customer can assign up to five thou-sand key pairs per region if needed. Usually the number of key pairs requiredequals the number of user the customer wants to grant access. After a user isauthenticated he/she needs to belong to a Security Group, that grants accessto a running Instance.

A.3.2.2 Interfaces

to user administration, programming(API) or end-user applications, need to beespecially protected against attacks. Non-authorized control over Interfaces en-ables destruction of data, management of users and provisioning of computerresources. Authentication and access control are key when preventing this kind

59

Page 60: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

of threats. Amazon EC2 offers complete control over authorized users and re-sources using the AWS Identity and Access Management(IAM)11. The servicealso enables different Roles within the cloud environment. For example, as-signing special credentials to developer users, used when developed programsauthenticate with the Amazon EC2 API.

Another layer of protection, offered by Amazon, is called AWS Multi-Factor Au-thentication (MFA)12, authenticating user with the 2-factor approach. The firstfactor is what the user knows(username and password). The second factor con-stitutes the additional security of MFA and is what the user has (authenticationcode from MFA device). This service is optional and offered at no additionalcharge.

A.3.2.3 Data Security

is the protection of data in terms of integrity, confidentiality and availability.Preserving data integrity refers to preventing data loss through redundant back-ups and fixing corrupt data. To uphold data confidentiality user managementcontrol is vital, but also to encrypt vital data in case of a security breech. Theavailability of data refers to user being able to access the requested data. Guar-anteed Service uptime and protection against Dos-attacks (Denial of service)are examples of what should be considered. Amazon EC2 provides services toease concerns related to lack of data security. The storage types Amazon EBSand Amazon S3 both offer automatic redundant storage, see Storage section,to keep data integrity. Amazon S3 also offers data integrity checks by validatingdata against a checksum.

The guarantees of uptime provided by Amazon are regulated in their ServiceLevel Agreements(SLA). Noteworthy is that they market their Amazon S3 stor-age of being an exceptionally durable and dependable service.13 Another im-portant aspect of data security concerns data disposal. A customer should becertain that data is properly destroyed when demanded. How and if Amazoncompletely disposes the customer’s data when demanded(including hidden back-ups, logs etc.) is not clear. Amazon mentions that all EBS volumes are wipedbefore they are reused, but there is no way for the customer to be certain.14

Encryption is a vital step in ensuring data confidentiality. The process makesdata available to users with a decryption key. There are two cases where en-cryption is especially important, when authenticating users and managing data.Amazon authenticates users in a number of ways. The first is through Accesskeys, theses are assigned to a user allowing them to access various APIs, toolsand consoles at AWS. The second authentication is required when accessingan Instance, and is done through key-pairs that are 1024 bit encrypted. Theroot-user needs to assign access-keys to user and assign them privileges. A setof key-pairs can be created, where the private key is managed by user and usedwhen accessing Instances. For more information see network security section.

11http://aws.amazon.com/iam/12http://aws.amazon.com/iam/details/mfa/13http://aws.amazon.com/s3/14Amazon Web Services: Overview of Security Processes, p.21

60

Page 61: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

Encrypting data is important and Amazon offers a server-side encryption solu-tion for the Amazon S3 storage. This means that all data on S3 will automat-ically be encrypted and decrypted without the need to change the customersapplications. However this is the only encryption Amazon offers, but they dorecommend further encrypting data hosted on etc. an EBS volume. This couldbe done using a third party software that is manually installed by the customeron the host OS, loaded by the AMI.

A.3.2.4 Virtualization

Virtualizatin on the Cloud refers to concerns regarding isolation of the cus-tomer’s Virtual Machines(VMs) and applications. The nature of the concernlies in the sharing of hardware between different customers’ VMs. Threats likeCross-VM manipulation, meaning that a VM tries to acquire data of anothermachine through manipulation of memory and storage, is an example of ma-licious activity. The Hypervisor is a piece of software that manages all VMs.Keeping the Hypervisor protected by regular updates and monitoring is impor-tant. This responsibility lies on the provider and the customer should pick aprovider with well documented update and monitor procedures.

Instances on the same Availability Zone on Amazon EC2 can share underlyinghardware. The Hypervisor software is called Xen which is actively developed inthe Xen community where Amazon is active. In addition to the Hypervisor, thatlogically separates Instances, Amazon uses the AWS firewall. All data packagesmust pass the extra layer, further isolating Instances. Amazon states the fol-lowing: ’Instances can be treated as if they are on separate physical hosts’15

A.3.2.5 Governance

Governance concerns are related to losing administrative and security controls inthe Cloud environment. Migrating from a private Cluster, where the customerhas total control, to the Cloud naturally gives rise to some concerns. The cus-tomer needs to fully rely on the provider’s administration and user interfaces,which amounts to a huge responsibility put on the cloud provider. Creatingan application on the cloud environment involves a great deal of customizationto function properly. The customer needs to be careful to the amount of cus-tomization required. In worse case scenarios the application will heavily relyon the cloud service and migrating the application to another provider couldprove too costly. This phenomena is called Vendor Lock-in and is definitelya widely spread concern.

Amazon, like many other providers, requires customization of the customer’ssoftware. For example, Amazon uses AMIs, that are tailored to AWS. Thesoftware needs to be made compatible with an AMI before it can be launchedto an Instance. Administration and data control interfaces are user-friendly andrigorously documented16, reassuring customers that choose Amazon as theirprovider.

15Amazon Web Services: Overview of security Processes, Nov. 2013, Amazon16http://aws.amazon.com/documentation/

61

Page 62: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

A.3.2.6 Compliance

Compliance concerns of the provider failing to comply with agreements and con-tracts established with the customer. The customer should take part of ServiceLevel Agreements(SLA), regulating required service availability or adoption ofbasic security procedures. On top of the SLA the customer should have auditcapabilities, assessing security or availability concerns by themselves or througha third party.

Amazon offers Service Level Agreements and provides information on third partyaudits that AWS annually undergoes.17 The customer can perform own auditsbut needs to inform Amazon through the AWS Vulnerability / PenetrationTesting Request Form.18 There exists a policy regulating permitted andprohibited behaviour on AWS.19

A.3.2.7 Legal issues

Legal issues regards jurisdiction affecting the customer’s services and applica-tions; the nature of the issue can differ depending on the country in which theservice is hosted. For example, Subpoena law-enforcement measures can re-sult in computer hardware being seized for evidence, potentially shutting downservices belonging to several unaware customers. Data disclosure is also an im-portant aspect. The customer could be forced to provide sensitive data if a judgedeems it important in court. Failing to provide the information could lead toserious ramifications in many countries. A potential customer is recommendedto thoroughly read the service terms and customer agreements to understandwhat duties and responsibilities the provider and customer has.

Amazon AWS customer Agreement: http://aws.amazon.com/agreement/Amazon AWS Service terms: http://aws.amazon.com/serviceterms/AWS Acceptable Use Policy: http://aws.amazon.com/aup/AWS Privacy: http://aws.amazon.com/privacy/

A.3.3 Amazon Virtual private Cloud (Amazon VPC)

Amazon VPC gives further security, by provisioning a logically separated sec-tion of Amazon Web Services. In Amazon EC2, Instances are launched within aselected Availability Zone and are assigned a public IP address from the Ama-zon’s public IP address pool. Amazon VPC enables the user to isolate Instancesby grouping them into networks with private(ipv4) addresses. Within the pri-vate cloud the user can configure sub-networks and set up routing to controlthe flow of traffic. Amazon Virtual private cloud should be considered when thecustomer requires high control of traffic flow and user access, which often is thecase when connecting a cloud service to a company with many employees.

17http://aws.amazon.com/ec2/sla/18http://aws.amazon.com/security/penetration-testing/19http://aws.amazon.com/aup/

62

Page 63: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

A.3.3.1 The infrastructure

The Amazon VPC comes with several templates that provide varying level ofpublic access. For example, the templates determine the number of InternetGateways and the number of Virtual Private Gateways to the VPC. An illus-tration of how a VPC can be configured is seen in figure 39.

Figure 39: The VPC infrastructure with two Gateways, one private and onepublic. The VPC’s internal traffic is shielded from the rest of EC2.

To best understand the VPC infrastructure is to follow the data. Informationis passed to and from the VPC using either a Virtual Private Gateway or an In-ternet Gateway. The former enables dedicated access using VPN-services frome.g a companies internal network to the VPC. Internet Gateways allow publicaccess to chosen parts of the VPC, like a web page. The data needs to passseveral filters before entering or leaving an Instance. The first stop is called therouting table which basically routes the data to the correct subnet and correctInstance IP address. Then comes the Access Control Lists(ACL) that can beviewed as an optional firewall, securing a subnet. Like Security Groups, theACL can be configured to only allow certain protocols and IP addresses to andfrom a subnet. When data has entered a subnet it will travel on private IPAddresses. Security Groups in a VPC are similar to the Security Groups inEC2, but need to be configured separately. To setup a VPC on Amazon thecustomer needs to access the Amazon VPC console. Amazon provides a step

63

Page 64: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

by step tutorial on how to do this.20

A.3.3.2 Amazon Direct Connect

Amazon Direct Connect offers a dedicated connection between a customersinternal network and an AWS region. With this connection in place the trafficcan bypass Internet Service providers in the network path. The main benefitsare: consistent network performance, isolated communication to VPCs, andlower transfer costs.21

A.3.3.3 Dedicated Instances

When running a VPC, Amazon EC2 Instances can be launched with dedicatedtenancy. This means that the Instance is physically isolated at the host hard-ware level, adding another layer of protection against threats discussed underVirtualization. This option is also available for an entire VPC. Amazon takesan additional cost for these services.

20http://docs.aws.amazon.com/AmazonVPC/latest/GettingStartedGuide/Wizard.html21http://aws.amazon.com/directconnect/

64

Page 65: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

B Parameter study

B.1 Discretization - method and results

Obj.63 is run with three of fmincon’s algorithm on the ADM, the results areshown in figure 40.

Figure 40: The minimum converges to a single value for increasing fdJ and fdM.There is a difference between execution times of the ADM, that depend on thealgorithm.

The results show that regardless of algorithm the solution converges towards asingle value as the parameters fdJ or fdM increase; the execution time growssub-linearly with the parameters. It can be concluded that fdJ = 11 andfdM = 6 is sufficient when considering the trade-off between accuracy and ex-ecution time from the graphs. Noteworthy is that there exists a difference inexecution time between the three algorithms, where active-set is the fastest.This is also mentioned in the previous ADM report [16].

These results are assumed to be the same regardless of the objective function,since the parameters only affect the computations of the temperature field.However the same test was performed for objective function 71. It confirmsthe assumption, but also show that objective functions take different time tooptimize. Comparing the results from figure 40 with the results for objectivefunction 71, graphed in appendix B.2.1, the execution time is nearly 50% lessthan for obj. 63.

Finally a variation of the number of disks in the deformation zone and theinterpass zone was conducted. As for the discretization parameters an increas-ing number of disks led to a convergence to a single value. The final number ofdisks in the deformation zone will be set to 2 and in the interpass zone set to 60.

65

Page 66: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

B.2 Convergence study - method and results

The first test is done by supplying 500 random initial points and observing thedistribution of found local minima. The initial points are with a high proba-bility not feasible due to the complex non-linear optimization constraints. Theresults, found in figure 41, show that about 15-20% of the initial points willfail to converge. This means that the model computes non-physical results andhence stops fmincon. Failed points terminate at the first iteration of fminconand are hence not time-consuming.

Figure 41: Convergence study of 500 initial points run for two algorithms:Active-set, SQP and for two objective functions: 71, 63. Points that failed(blue)are roughly 20 %, points that partly converge(red) 2-52 % and points that con-verged to global minimum(green) 29-85 %.

Then there are points that converge to the global minimum and points thatconverge, but terminate before a possible local minimum was found. Here thealgorithm largely affects the distribution. It can be concluded that SQP ingeneral has a higher probability in finding the global minimum(85%, 79%) thenAS(29%, 75%), this is especially true when comparing objective functions.

In an attempt to understand why certain initial points fail, a heat map overthe x-component values was created, see figure 42. The heat maps illustratefrequent x-component values for failed points and for points that converge to

66

Page 67: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

the global minimum. The shape of the heat maps are due to the boundaries, asseen in figure 13.

Figure 42: Heat map of all the initial points x between the upper and lowerbounds. The color correlates to the number of x-vectors with the same x-component value. A comparison between initial points that converged to globaloptimum(left) and points that did not converge(right) show no clear relationbetween x-component value and convergence.

The x-component values are scattered throughout the bounds and no apparentrelationship between initial point and failed or global convergence can be seen.However, there is a slight tendency that initial points, in the x-component range13-22(gap size), with values close to one of the boundaries, will fail. This canbe seen as the light blue/cyan boxes in the right heat map in figure 42.

The number of iterations required to reach a minimum may vary between initialpoints, and the choice of algorithm may be important. By analysing the datafrom figure 41, results were found indicating that obj. 71 and obj. 63 responddifferent to a random set of initial points. For obj. 71 the average number ofiterations is about 25. However, for certain initial points the required numberof iterations could drastically increase, up to 100 iterations, as seen in figure43. Obj. 63 has an average of 50 to 60 iterations, depending on the algorithmused. The function does not exhibit spike in the number of iterations, but hasa higher spread in the number of iterations required. Figures illustrating theresults for obj. 63 are found in appendix B.2.1.

B.2.1 Parameter study - Additional plots

67

Page 68: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

Figure 43: Illustration of ’problem points’, where the number of iterationsrequired(y-axis) by fmincon ’spikes’ for certain runs(x-axis).

Figure 44: Varying discretization paramters fdj, fdM. The obtained minimumconverges with increasing discretization.

68

Page 69: High Performance Optimization on Cloud for a Metal Process ...uu.diva-portal.org/smash/get/diva2:732246/FULLTEXT03.pdf · High Performance Optimization on Cloud for a Metal Process

Figure 45: The number of iterations required (y-axis,left plot) for 400 fminconruns with Active-set. Corresponding objective function (y-axis, right plot).

Figure 46: The number of iterations required (y-axis,left plot) for 400 fminconruns with SQP. Corresponding objective function (y-axis, right plot)

69