data center workload measurement and analysis

62
Presented by: -Ankita Duggal -Gurkamal Deep Singh Rakhra -Keerthana Muniraj -Preeti Sawant Data Center Workload Measurement and Analysis 1

Upload: gurkamal-deep-singh-rakhra

Post on 09-Aug-2015

109 views

Category:

Internet


1 download

TRANSCRIPT

Page 1: Data Center Workload Measurement and Analysis

1

Presented by:

-Ankita Duggal

-Gurkamal Deep Singh Rakhra

-Keerthana Muniraj

-Preeti Sawant

Data Center Workload Measurement and Analysis

Page 2: Data Center Workload Measurement and Analysis

What is a Data center ?•A large group of networked computer servers typically used by organizations for the remote storage, processing, or distribution of large amounts of data.

•It doesn’t house only servers but also contains backup power supplies, communication connections, air conditioning, fire supplies etc.

•“A data center is a factory that transforms and stores bits”

Page 3: Data Center Workload Measurement and Analysis

3

A few glimpses of Data Center of a few organizations …Rackspace - Richardson,TX Facebook – Lulea, Sweden

Google- Douglas County, Georgia Amazon – Virginia, outside Washington D.C

Page 4: Data Center Workload Measurement and Analysis

Google’s floating data center Aliyun (Alibaba) – Hangshou, China

Page 5: Data Center Workload Measurement and Analysis

5

Data Center workload

• Amount of processing that the computer has been given to do at a given time.

• Workload — in the form of web requests, data analysis, multimedia rendering, or other applications – is placed in the data center

Ref: http://searchdatacenter.techtarget.com/definition/workload

Page 6: Data Center Workload Measurement and Analysis

Classification of workloads based on time criticality

Critical Workloads Non-critical Workloads

“Cannot tolerate even a few minutes of downtime”

can tolerate a wide range of outage times

Page 7: Data Center Workload Measurement and Analysis

Ways to improve data protection

• Prevent downtime by reducing resource contention : Managers accommodates drastically changing demands on workloads by allowing easy creation of additional workloads without changing or customizing its applications.

• Replicate workloads into cloud to create asymmetric “Hot back-ups”:Clone the complete workload stack. Import into public/private cloud

• Using dissimilar infrastructure for off-premises redundancies:Workloads are replicated off-site to different cloud providers.

• Concept of “Failures or Failback” reserved only for critical workloads: Automating the switching of users or processes from production to recovery instances

Page 8: Data Center Workload Measurement and Analysis

8

Characterizing Data Analysis workloads in Data Centers

• Data Analysis is important improving future performance of data center

• Data center workloads services workload (web search, media streaming)

data analysis workload ( business intelligence, machine learning )

• We concentrate on internet services workload here

• Data analysis workloads are diverse in speedup performance and micro-architectural characteristics. Therefore, there is a need to analyze many applications

• 3 important application domains are in internet services are : 1) search engine 2) social networks 3) electronic commerce

Page 9: Data Center Workload Measurement and Analysis

9

Workload requirements :1)most important application domain2) data is distributed, data can not be processed on single node3)consider recently used data

Page 10: Data Center Workload Measurement and Analysis

10

Breakdown of Executed Instructions

Page 11: Data Center Workload Measurement and Analysis

11

DCBench :

• Benchmarks used to evaluate new designs and systems benefit

• DCBench is a benchmark suite for data center computing, with an open source license.

• Includes online and offline workload

• Includes different programming model like MPI versus MapReduce

• Helpful for performing architecture and small to medium scale system researches for data center computing.

Page 12: Data Center Workload Measurement and Analysis

12

Methodologies

Page 13: Data Center Workload Measurement and Analysis

13

Workflow Phases

Extract• Look for raw data• Generates stream

of data

Partition• Divides stream into

buckets

Aggregate• Combines/reduces

Page 14: Data Center Workload Measurement and Analysis

14

Patterns comprising traffic in Data Center

Work-seeks-bandwidth

Scatter gather pattern

Page 15: Data Center Workload Measurement and Analysis

15

Work-seeks-bandwidth

• chip designers prefer placing components that interact often (e.g., cpu-L1 cache, multiple CPU cores) close by to get high bandwidth interconnections on the cheap

• Jobs are placed in data center that rely on heavy traffic exchanges with each other in areas where high network bandwidth is available.

Page 16: Data Center Workload Measurement and Analysis

16

Contd..

This translates to the engineering decision of placing jobs within thesame server, within servers on the same rack or within servers inthe same VLAN and so on with decreasing order of preference andhence the work-seeks-bandwidth pattern.

Page 17: Data Center Workload Measurement and Analysis

17

Scatter gather pattern• data is partitioned into small chunks, each of which is worked on by

different servers, and the resulting answers are later aggregated.

Page 18: Data Center Workload Measurement and Analysis

18

Congestion

• Periods of low network utilization indicate Application that demands more of other resources- CPU, disk than network Application can be rewritten to make better use of available bandwidth

Page 19: Data Center Workload Measurement and Analysis

19

Evacuation event (congestion)

• When a server repeatedly experiences problems, the automated management system in the cluster evacuates all the usable blocks on that server prior to alerting a human that the server is ready to be re-imaged.

Page 20: Data Center Workload Measurement and Analysis

20

Read failure

• When a job does not make any progress it is killed (unable to find input data, or unable to connect to a machine)

Page 21: Data Center Workload Measurement and Analysis

21

Contd.

• To attribute network traffic to the applications that generate it, the network event logs and logs at the application-level were merged that describe which job and phase (e.g., map, reduce) were active at that time. Results showed that, jobs in the reduce phase are responsible for a fair amount of the network traffic.

• Note that in the reduce phase of a map-reduce job, data in each partition that is present at multiple servers in the cluster (e.g., all personnel records that start with ‘A‘) has to be pulled to the server that handles the reduce for the partition .

Page 22: Data Center Workload Measurement and Analysis

22

Monitoring Data Center Workload

• For coordinated monitoring and control of data centers, the most commonly approaches are based on Monitor, Analyze ,Plan and Execute (MAPE ) control loops.

Overview

Page 23: Data Center Workload Measurement and Analysis

23

Modern Data Center Operation

• Workload in the form of web requests, data analysis, etc is placed in the data center.

• An instrumentation infrastructure logs sensor readings.

• The results are fed into a policy engine that creates a plan to utilize resources.

• External interfaces or Actuators implement the plan.

Page 24: Data Center Workload Measurement and Analysis

24

Workload Monitoring using Splice

• Splice aggregates sensor and performance data in a relational database.

• It also gathers data from many sources through different interfaces with different formats.

• Splice uses change of value filter that retains only those values that differ significantly from the previously logged values.

• It reduces minimal loss of information.

Page 25: Data Center Workload Measurement and Analysis

25

Database Schema Of Splice

Page 26: Data Center Workload Measurement and Analysis

26

Implementation

• Splice uses change of value filter that retains only those values that differ significantly from the previously logged values.

• It reduces minimal loss of information.

Page 27: Data Center Workload Measurement and Analysis

27

Analysis

• Data analysis is done by two main classes- attribute behavior and correlation.

• Attribute behavior describes the value of the observed readings and how those values change over time.

• Data correlation methods determine the strength of the correlations among the attributes affecting each other.

Page 28: Data Center Workload Measurement and Analysis

Virtualization in Data Centers

• Virtualization is a combination of software and hardware features that creates virtual CPUs (vCPU) or virtual systems-on-chip (vSoC).

• Virtualization provides the required level of isolation and partitioning of resources.

• Each VM is protected from interference from another VM.

Reference: Multicore Processing: Virtualization and Data Center By: Syed Shah, Nikolay Guenov

Page 29: Data Center Workload Measurement and Analysis

Why Virtualization

• Reduced power consumption and building space, providing high availability for critical applications and streamlining application deployment and migration.

• To support multiple operating systems and consolidation of services on a single server by defining multiple VMs.

• Multiple VMs can run on a single server, the advantage is of reduced server inventory and better server utilization.

Reference: Multicore Processing: Virtualization and Data Center By: Syed Shah, Nikolay Guenov

Page 30: Data Center Workload Measurement and Analysis

Benefits Of Virtualization

Reference: Multicore Processing: Virtualization and Data Center By: Syed Shah, Nikolay Guenov

Page 31: Data Center Workload Measurement and Analysis

Multi Core Processing

• A multi-core processor is a single computing component with two or more independent actual processing units (called "cores"), which are the units that read and execute program instructions.

Reference: Multicore Processing: Virtualization and Data Center By: Syed Shah, Nikolay Guenov

Page 32: Data Center Workload Measurement and Analysis

Virtualization and Multicore Processing

• With multicore SoCs, given enough processing capacity and virtualization, control plane applications and data plane applications can be run without one affecting the other.

• Data or control traffic that is relevant to the customized application and operating system (OS) can be directed to the appropriate virtualized core without impacting or compromising the rest of the system.

Reference: Multicore Processing: Virtualization and Data Center By: Syed Shah, Nikolay Guenov

Page 33: Data Center Workload Measurement and Analysis

Control and Data Plane Application Consolidation in virtualized Multicore SoC

Page 34: Data Center Workload Measurement and Analysis

• Functions that were previously implemented on different boards now can be consolidated onto a single card and a single multicore SoC.

Reference: Multicore Processing: Virtualization and Data Center By: Syed Shah, Nikolay Guenov

Page 35: Data Center Workload Measurement and Analysis

Data center Reliability

Network Reliability

Characterizing most failure prone network elements

Estimating the impact of failures

Analyzing the effectiveness of network

redundancy

Reference: Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications By: Phillipa Gill, Navendu Jain, Microsoft Research

Page 36: Data Center Workload Measurement and Analysis

Key Observations

• Data center networks are reliable

• Low-cost, commodity switches are highly reliable

• Load balancers experience a high number of software faults

• Failures potentially cause loss of a large number of small packets.

• Network redundancy helps, but it is not entirely effective

Reference: Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications By: Phillipa Gill, Navendu Jain Microsoft Research

Page 37: Data Center Workload Measurement and Analysis

Reasons to change from traditional

Significant changes in computing power, network bandwidth, and network file system usage

• Network file system workloads

• No CIFS protocol studies

• Limited file system workloads

Reference: Measurement and Analysis of Large-Scale Network File System Workloads by Andrew W. Leung, Shankar Pasupathy, Garth Goodson, Ethan L. Miller

Page 38: Data Center Workload Measurement and Analysis

Access Pattern

Read Only Write Only Read and Write

Analysis

• File Access Patterns:

Reference: Measurement and Analysis of Large-Scale Network File System Workloads by Andrew W. Leung, Shankar Pasupathy, Garth Goodson, Ethan L. Miller

Page 39: Data Center Workload Measurement and Analysis

Sequential Access

Entire Partial

• Sequentiality Analysis:

Reference: Measurement and Analysis of Large-Scale Network File System Workloads by Andrew W. Leung, Shankar Pasupathy, Garth Goodson, Ethan L. Miller

Page 40: Data Center Workload Measurement and Analysis

File Lifetime

• CIFS, files can be either deleted through an explicit delete request, which frees the entire file and its name, or through truncation, which only frees the data

• CIFS users begin a connection to the file server by creating an authenticated user session and end by eventually logging off.

Reference: Measurement and Analysis of Large-Scale Network File System Workloads by Andrew W. Leung, Shankar Pasupathy, Garth Goodson, Ethan L. M.

Page 41: Data Center Workload Measurement and Analysis

Architecture

Load Balancer

IP address to which requests are sent is called a virtual IP address

(VIP)

IP addresses of the servers over which the requests are spread are

known as direct IP addresses (DIPs).

• Inside the data center, requests are spread among a pool of front- end servers that process the requests. This spreading is typically performed by a specialized load balancer.

Reference: Towards a Next Generation Data Center Architecture: Scalability and Commoditization By Albert Greenberg, David A. Maltz Microsoft Research, WA, USA

Page 42: Data Center Workload Measurement and Analysis

Challenges and RequirementsChallenges

• Fragmentation of resources

• Poor server to server connectivity

• Proprietary hardware that scales up, not out

Requirements:

• Placement anywhere

• Server to server bandwidth

• Commodity hardware that scales out

• Support 100,000 serversReference: Towards a Next Generation Data Center Architecture: Scalability and Commoditization By Albert Greenberg, David A. Maltz Microsoft Research, WA, USA

Page 43: Data Center Workload Measurement and Analysis

Load Balancing

Load Balancing

Load Spreading:

requests spread evenly over a pool of servers

Load Balancing:place load balancers in front

of the actual servers

Reference: Towards a Next Generation Data Center Architecture: Scalability and Commoditization By Albert Greenberg, David A. Maltz Microsoft Research, WA, USA

Page 44: Data Center Workload Measurement and Analysis

44

Case studies

Page 45: Data Center Workload Measurement and Analysis

– a few real-time scenarios

Why build a Data center at Virginia when there is one at California?

• Reduce the time to send a page to users on the East Coast

• California – running out of space Virginia – lots of room to grow

• restricting to one datacenter meant that in the event of disaster(earthquake, power failure, Godzilla) Facebook could be usable for extended amount of time.

Page 46: Data Center Workload Measurement and Analysis

The hardware and network were set up soon..but how to handle cache consistency?

Master DB

Sl

Page 47: Data Center Workload Measurement and Analysis

Facebook’s Scheduling with Corona

• With Facebook’s user base expanding at an enormous rate, the development of a new scheduling framework called CORONA came into place.

• Initially a MapReduce implementation of Apache Hadoop served as the foundation of the infrastructure. But this system over the years developed several issues. These were:

Scheduling overhead Pull based scheduling model Static slot-based resource management model

Page 48: Data Center Workload Measurement and Analysis

Facebook’s Solution• Corona introduces a cluster manager whose only purpose is to track the nodes in the cluster and the amount of free resources.

• Corona uses push based scheduling. This reduces scheduling latency.

• The separation of duties allows Corona to manage a lot more jobs and achieve better cluster utilization.

• The cluster manager also implements fair-share scheduling.

Page 49: Data Center Workload Measurement and Analysis

Future of Corona

• New features such as Resource based scheduling than slot based modelOnline upgrades to the cluster managerExpansion of user base by scheduling applications such as Peregrine

Page 50: Data Center Workload Measurement and Analysis

Characterizing backend workload(at

Google)

Ref: Towards Characterizing Cloud Backend Workloads: Insights from Google Compute Clusters (Asit K. Mishra Joseph L. Hellerstein Walfredo Cirne Chita R. Das)

Page 51: Data Center Workload Measurement and Analysis

Pre-requisites

• Capacity planning to determine which machine resources must grow and by how much and

• Task scheduling to achieve high machine utilization and to meet service level objectives

• Both these require good understanding of task resource consumption i.e CPU and memory usage.

Page 52: Data Center Workload Measurement and Analysis

The approaches

1. Make each task its own workloadScales poorly since tens of thousands of tasks execute daily on google computes clusters.

2. View all tasks as belonging to one single taskResults on large variances in predicted resource consumptions.

Page 53: Data Center Workload Measurement and Analysis

The proposed methodology

• identifying the workload dimensions• constructing task classes using an off-the-shelf algorithm such as k-

means• determining the break points for qualitative coordinates within the

workload dimensions • merging adjacent task classes to reduce the number of workloads

Page 54: Data Center Workload Measurement and Analysis

Based on

• the duration of task executions is bimodal in that tasks either have a short duration or a long duration

• most tasks have short durations• Most resources are consumed by a few tasks with long duration that

have large demands for CPU and memory

Page 55: Data Center Workload Measurement and Analysis

Objective

• construct a small number of task classes such that tasks within each class have similar resource usage.

• We use qualitative coordinates to distinguish workload- small(s), medium(m), large(l)

Page 56: Data Center Workload Measurement and Analysis
Page 57: Data Center Workload Measurement and Analysis

First step

• Identify the workload dimensions.• For example, in analysis of the Google Cloud Backend, the workload

dimensions are task duration, average core usage, and average memory usage

Page 58: Data Center Workload Measurement and Analysis

Second step

• Constructs preliminary task classes that have fairly homogeneous resource usage. It is done by using the workload dimensions as a feature vector and applying an off-the-shelf clustering algorithm such as k-means

Page 59: Data Center Workload Measurement and Analysis

Third step

• determining the break points for the qualitative coordinates of the workload dimensions. It has two considerations. First, break points must be consistent across workloads. For example, the qualitative coordinate small for duration must have the same break point (e.g., 2 hours) for all workloads. Second, the result should produce low within-class variability

Page 60: Data Center Workload Measurement and Analysis

Fourth step

merges classes to form the final set of task classes. These classes define our workloads. This involves combining “adjacent” preliminary task classes. Adjacency is based on the qualitative coordinates of the class. For example, in the Google data, duration has qualitative coordinates small and large; for cores and memory, the qualitative coordinates are small, medium, large. Thus, the workload smm is adjacent to sms and sml in the third dimension. Two preliminary classes are merged if the CV(coefficient of variance) of the merged classes does not differ much from the CVs of each of the preliminary classes. Merged classes are denoted by the wild card “*”. For example, merging the classes sms, smm and sml yields the class sm*

Page 62: Data Center Workload Measurement and Analysis

62

Questions?