zhen jia (贾禛prof.ict.ac.cn/bpoe-hpc-china/wp-content/uploads/... · naive bayes svm grep...

40
INSTITUTE OF COMPUTING TECHNOLOGY DCBench: a Data Center Benchmark Suite Zhen Jia (贾禛) http://prof.ict.ac.cn/zhenjia/ Institute of Computing Technology, Chinese Academy of Sciences 2nd BPOE workshop in conjunction with CCF HPC China 2013 October 31,2013,Guilin

Upload: others

Post on 07-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

INS

TITUTE O

F CO

MP

UTIN

G TEC

HN

OLO

GY

DCBench: a Data Center Benchmark Suite

Zhen Jia (贾 禛)http://prof.ict.ac.cn/zhenjia/

Institute of Computing Technology, Chinese Academy of Sciences

2nd BPOE workshopin conjunction with CCF HPC China 2013

October 31,2013,Guilin

Page 2: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Workload SpectrumCPU intensive Memory intensive

I/O intensiveFigure from Intel

Page 3: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Workload Spectrum

Data Centers

Page 4: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Why Benchmarking ?

• Sometimes there is a solution.

Page 5: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Why Benchmarking ?

• What about the solution when …

Page 6: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Benchmark’s Role in Computer Science 

“Benchmarking is the quantitative foundation of computer system and architecture research, are used to experimentally determine the benefits of new designs.” 

‐‐ C. Bienia, S. Kumar, J. Singh, and K. Li. The parsec benchmark suite: Characterization and architectural implications. PACT 2008

Page 7: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

State‐of‐Practice Benchmark Suites

SPEC CPU SPEC Web HPCC PARSEC

TPCCYCSBGridmix

Page 8: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Data Centers [1]

• Distinguishing features:Massive scaleMixed workloads

• Workload classification:Online services (service)

E.g. Web search 

Offline processing (data analysis)E.g. MapReduce programs 

[1] Barroso et al , “The Datacenter as a Computer”, 2009

Page 9: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Previous Work

CloudSuite (Ferdman et al., “Clearing the Clouds”, ASPLOS 2012)– Six scale‐out workloads:

Web search  

Web serving    

Media streaming

Data serving

Data analytic(Bayes)

Software testing

Service workloads

Data Analysis Workload

They incline to service workloads !

Page 10: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Scale‐out Performance of Data Analysis Workloads 

• Speed Up CloudSuite data analyticBayes

Data analysis workloads are diversified!

Page 11: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Content

• Background and Motivation

• DCBench

• Workloads Characterization 

Page 12: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

DCBench

Workloads

VM Operation

Scale‐outService 

Data Analysis

Web site: http://prof.ict.ac.cn/DCBench/

DCBench

Release on July 2013

Page 13: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Methodology of Workloads Choosing

• Step 1: Rank main websites and web services according to page views and daily visitors

• Step 2: Decompose the service programs into algorithms and basic operations 

• Step 3: Select algorithms and basic operations according to their popularity

Page 14: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Step 1: Ranking 

Page 15: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Top Sites on the Web

More details in http://www.alexa.com/topsites/global;0

40%

25%

15%

5% 15%

Search Engine Social NetworkElectronic Commerce Media StreamingOthers

Top Sites on the Web

Page 16: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Step 2: Decomposing

Page 17: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Algorithms in Search Engine

graph mining

grep & segmentation

pagerankword count

sort

vector calculationFigure from “The Anatomy of a Large-Scale Hypertextual Web Search Engine”

Page 18: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Algorithms in Recommendation Sub‐systems

Page 19: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

40%

25%

15%

5% 15%

Search Engine Social NetworkElectronic Commerce Media StreamingOthers

Summary of Anatomy of Common Services

Algorithms used in Search:PagerankSegmentationFeature ReductionGrepStatistical countingsortRecommendation……

Top Sites on The Web

Algorithms used in Social Network:RecommendationClustering ClassificationGrepFeature ReductionStatistical countingSort……

Algorithms used in electronic commerce:RecommendationAssociate rule miningWarehouse operationClustering ClassificationStatistical counting……

Page 20: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Step 3: Selecting 

Page 21: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

40%

25%

15%

5% 15%

Search Engine Social NetworkElectronic Commerce Media StreamingOthers

Top Operations and Algorithms 

Top Sites on The Web

Grep Pagerank

Recommendation

Page 22: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Main Algorithms in Data Centers

Data centeralgorithms

Basic operation

Association rule mining

Classification

Cluster

Recommendation

Warehouse operation

Feature reduction

Graph mining

Vector calculate

Segmentation

Page 23: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Overview of DCBenchCategory Workloads Programming

modellanguage source

Basic operation Sort MapReduce Java Hadoop

Wordcount MapReduce Java Hadoop

Grep MapReduce Java Hadoop

Classification Naïve Bayes MapReduce Java Mahout

Support Vector Machine MapReduce Java Implementedby ourself

Cluster K‐means MapReduce Java Mahout

MPI C++ IBM PML

Fuzzy k‐means MapReduce Java Mahout

MPI C++ IBM PML

Recommendation Item basedCollaborative Filtering

MapReduce Java Mahout

Association rule mining

Frequent patterngrowth

MapReduce Java Mahout

Segmentation  Hidden Markov model MapReduce Java Implementedby ourself

Page 24: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Category Workloads Programming model

language source

Warehouse operation

Database operations MapReduce Java Hive‐bench

Feature  reduction

Principal Component Analysis

MPI C++ IBM PML

Kernel Principal Component Analysis

MPI C++ IBM PML

Vector calculate Paper similarity analysis

All‐Pairs C&C++ Implemented by ourself

Graph mining Breadth‐first search MPI C++ Graph500

Pagerank MapReduce Java MahoutService  Search engine C/S Java Implemented by ourself

Auction  C/S Java Rubis

Service  Media streaming C/S Java Cloudsuite

Overview of DCBench (Cont’)

Page 25: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Content

• Background and Motivation

• DCBench

• Workloads Characterization [2]

[2] Zhen Jia et al, “Characterizing Data Analysis Workloads in Data Centers”IISWC 2013 Best Paper

Page 26: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Compared Benchmarks

Filed  : Scale out workloads HPC CPU Web

Workloads :

CloudSuite v1 HPCC SPEC CPU 2006 SPEC Web 2005

Web search HPL SPEC INT TPC‐W

Data serving  Streaming SPEC FP

Web serving  Ptrans PARSEC

Media streaming RandomAccess

Software testing DGEMM

FFT

Comm

• Scale-out service workloads share many similarity characteristics with that of traditional service workloads.

• So we just use the service workloads to describe them

Page 27: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Breakdown of Executed Instructions 

• Analysis workloads have more application level instructions• The service workloads have higher percentages of kernel level 

instructions

Data analysisservice

0%10%20%30%40%50%60%70%80%90%

100%

Naive Bayes

SVM

Grep

WordC

ount

K‐means

Fuzzy K‐means

PageRa

nkSort

Hive‐ben

chIBCF

HMM avg

Software Testing

Med

ia Streaming

Data Serving

Web

 Search

Web

 Serving

SPEC

Web

TPC‐W

SPEC

FPSPEC

INT

PARSEC

HPCC

‐DGEM

MHP

CC‐FFT

HPCC

‐HPL

HPCC

‐PTR

ANS

HPCC

‐Rando

mAccess

HPCC

‐STR

EAM

kernel application

Page 28: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Architecture Block Diagram

Figure from Intel

Page 29: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Pipeline Stalls• The service workloads have more RAT (Register Allocation Table) stalls • The data analysis workloads have more RS (Reservation Station) and 

ROB (ReOrder Buffer) full stalls• Front end stalls !

Data analysis

Service

Page 30: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Main reason of pipeline stall: memory‐wall

Figure from :The Architecture of the Nehalem Processor And Nehalem-EP SMP Platforms

Page 31: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Reasons of Front End Stalls• High Icache misses and ITLB misses cause front end stall

Data analysis service

0

20

40

60

80

100

L1 IC

ache

 Miss p

er K‐In

struction

Page 32: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

0

20

40

60

80

100

L2 Cache

 misses pe

r k‐In

struction 

L2 Cache Behaviors

• Data analysis workloads have good L2 cache behaviors

Data analysis

service

Page 33: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

LLC behaviors

• Data Center workloads – Have good LLC behaviors– Better than most of the HPC workloads

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Percen

tage of L2 misses s

atisfie

d by

 L3

Page 34: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Branch Prediction• Data analysis workloads have pretty good branch behaviors

• Branches of Services workloads are hard to predict

34

Data analysis service

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

Bran

ch m

ispred

ictio

n ratio

Page 35: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Some Observations• Analysis workloads are different from scale‐out service 

workloads and traditional workloads• For data analysis workloads, more app level instructions are 

executed• High Icache and ITLB misses

– Impact: High percentage of front end stall – Cause: Massive scale of software infrastructure, high level languages, third 

party lib– Rethink the design of Icache or ITLB or simplify SW stack

• Low level caches are good for data analysis workloads– Pay more attention to area and energy of caches

• The branch predictor is quite effective

Page 36: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

More information: http://prof.ict.ac.cn/DCBench/

Page 37: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Back up

Page 38: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Data Center v.s. Big Data

Big Data Analytic

Scale‐outService

VM Operation

DataIntensive

HPC

Data center Big Data

Page 39: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Each Algorithm’s Application ScenariosAlgorithm Application Scenarios

SortRanking the pages according to its importance (PageRank)Pages sorting by its ID (Web storage in database)

WordcountCalculating the TF‐IDF base information,such as term frequencyObtain the user operations count to analysis their social behavior (in Wolfram Alpha)

GrepLog analysisWeb information extractionFuzzy search

Naïve BayesSpam recognition(Spam Filtering with Naive Bayes)Bioinformatics(Naïve Bayesian Classifier for Rapid Assignment of RNA Sequences into the New Bacterial Taxonomy)

Support Vector MachineClassification ( Question Classification)Image Processing (Image annotation)Text Categorization

Page 40: Zhen Jia (贾禛prof.ict.ac.cn/BPOE-HPC-China/wp-content/uploads/... · Naive Bayes SVM Grep WordCount K ‐ means Fuzzy K ‐ means PageRank Sort ... Support Vector Machine Classification

HPC China 20132nd BPOE

Each Algorithm’s Application Scenarios (Cont’)K‐means

Image processing (Fast image segmentation)High‐resolution landform classification

Item‐based  Collaborative Filtering Amazon recommender system

Hidden Markov modelBioinformatics    (Protein homology detection)Speech recognition , Handwriting recognition Word Segmentation

Frequent pattern growth

Market AnalysisData mining in Business (identifying competitive suppliers in Supply Chain Management)Intrusion detectionQuery Recommendation

Warehouse operationTaobao Yunti system   FacebookYahoo! 

Principal Component Analysiscomputer visionpattern recognitionFace Representation and Recognition