the google cluster architecture presented by fatma canan pembe 2004800193

The Google Cluster Architecture

Presented byFatma Canan Pembe

2004800193

2

PURPOSE To overview the computer architecture of Google

One of the mostly known and used search engines today How it can achieve such a processing power under such

big workload

3

OUTLINE Introduction Cluster architectures Google architecture overview

Serving a Google query Design principles of Google clusters

Leveraging commodity parts Power problem Hardware-level characteristics Memory system

Summary

4

INTRODUCTION Search engines

require high amounts of computation per request

A single query on Google (on average) reads hundreds of megabytes of data consumes tens of billions of CPU cycles

A peak request stream on Google Thousands of queries per second requires an infrastructure comparable in size

to largest supercomputer installations

5

INTRODUCTION (Cont.) Google

Combines more than 15,000 commodity-class PCs Instead of a smaller number of high-end servers

Most important factors that influenced the design Energy efficiency Price-performance ratio

Google application affords easy parallelization Different queries can run on different processors A single query can use multiple processors

because the overall index is partitioned

6

CLUSTER ARCHITECTURES Cluster

collection of independent computers using switched network to provide a common service

Many mainframe applications run more "loosely coupled" machines than shared memory machines databases, file servers, Web servers, simulations, etc. Often need to be highly available, requiring error

tolerance and repairability Often need to scale

7

DISADVANTAGES OF CLUSTERS Cost of administering a cluster of N machines

administering N independent machines vs. cost of administering a shared address space N

processors multiprocessor administering 1 big machine

Clusters usually connected using I/O bus whereas multiprocessors usually connected on memory

bus Cluster of N machines has N independent

memories and N copies of OS but a shared address multi-processor allows 1 program

to use almost all memory

8

ADVANTAGES OF CLUSTERS Error isolation

separate address space limits contamination of error Repair

Easier to replace a machine without bringing down the system than in a shared memory multiprocessor

Scale easier to expand the system without bringing down the

application that runs on top of the cluster Cost

Large scale machine has low volume => fewer machines to spread development costs

vs. leverage high volume off-the-shelf switches and computers

Amazon, AOL, Google, Hotmail, and Yahoo rely on clusters of PCs to provide services used by millions of

people every day

9

GOOGLE ARCHITECTURE OVERVIEW Reliability

provided in software level rather than in server-class hardware so that commodity PCs can be used

to build a cluster at a low price

Design for best aggregate throughput rather than peak server response time

Building a reliable computing infrastructure from clusters of unreliable commodity PCs

10

SERVING A GOOGLE QUERY When user enters a query

e.g. www.google.com/search?q=ieee+society User browser

Domain Name System (DNS) lookup to map to a particular IP address

Multiple Google clusters distributed worldwide each cluster with a few thousand machines to handle query traffic

11

SERVING A GOOGLE QUERY (Cont.) Geographically distributed setup

protects against catastrophic failures DNS-based load-balancing system selects a cluster

according to user’s geographic proximity available capacity at various clusters

User’s browser sends HTTP request to one of the clusters thereafter, processing local to that cluster

12

SERVING A GOOGLE QUERY (Cont.) A hardware based load balancer in each cluster

monitors available Google Web Servers (GWSs) performs local load balancing of requests

A GWS machine coordinates the query execution returns results as HTML response

13

SERVING A GOOGLE QUERY (Cont.)

14

SERVING A GOOGLE QUERY (Cont.) Query execution phases

1. The index servers determine the relevant documents by consulting an inverted index challenging due to large amount of data

Raw documents -> several tens of terabytes of data Inverted index -> many terabytes of data

Fortunately, search is highly parallelizable by dividing the index into pieces (index shards) For each shard, a pool of machines serve

improving reliability a load balancer employed

2. The document servers determine the actual URLs and query-specific summaries of the found documents

Again documents are divided into shards

15

DESIGN PRINCIPLES OF GOOGLE CLUSTERS Software level reliability

No fault-tolerant hardware features; e.g. redundant power supplies A redundant array of inexpensive disks (RAID)

instead tolerate failures in software Use replication

for better request throughput and availability Price/performance beats peak performance

CPUs giving the best performance per unit price Not the CPUs with best absolute performance

Using commodity PCs reduces the cost of computation

16

FIRST GOOGLE SERVER RACK In the Computer History Museum (from year 1999) Each tray contains eight 22GB hard drives and one

power supply

17

LEVERAGING COMMODITY PARTS Google’s racks

consist of 40 to 80 x86-based servers Server components similar to mid-range desktop PC

except for larger disk drives Ranging

from single processor 533-MHz Intel-Celeron based servers

to dual 1.4-GHz Intel Pentium III servers Servers on each rack interconnennected via 100

Mbps Ethernet All racks interconnectd via a gigabit switch

18

LEVERAGING COMMODITY PARTS (Cont.)

Selection criterion Cost per query [Capital expense (with depreciation) + operating costs

(hosting, system administration, repairs)] / performance

inexpensive PC-based clusters vs high-end multiprocessor servers Rack -> 176 2-GHz Xeon CPUs + 176 Gbytes RAM + 7

Tbytes of disk space = $278,000 Server -> 8 2-GHz Xeon CPUs + 64 Gbytes RAM + 8

Tbytes of disk space = $758,000

19

LEVERAGING COMMODITY PARTS (Cont.)

Multiprocessor server about 3 times more expensive 22 times fewer CPUs 3 times less RAM

Cost difference of high-end server due to higher interconnect bandwidth reliability which are not necessary in Google’s highly redundant

architecture

20

THE POWER PROBLEM A mid-range server with dual 1.4-GHz Pentium III

processors 90 W of DC power

55 W for the two CPUs 10 W for a disk drive 25 W for DRAM and motherboard

Typical efficiency of an ATX power supply -> 75% means 120 W of AC power per server roughly 10 kW per rack

21

THE POWER PROBLEM (Cont.) A rack

25 ft2 of space Corresponding power density: 400 W/ ft2

With higher-end processors: 700 W/ft2

Typical power density for commercial data centers: between 70 and 150 W/ft2

Much lower than that required for PC clusters Special cooling or additional space required

to decrease power density to a tolerable level

22

THE POWER PROBLEM (Cont.) Reduced-power servers can be used; but

must be without a performance penalty must not be considerably more expensive

23

HARDWARE-LEVEL CHARACTERISTICS

Architectural characteristics of the Google query-serving application examined to determine hardware platforms

for best price/performance

Index server most heavily impacts the overall price/performance

24

INSTRUCTION LEVEL MEASUREMENTS ON THE INDEX SERVER

Characteristic (On a 1-GHz dual-processor Pentium III system)

Value

Cycles per instruction 1.1

Ratios (percentage) Branch mispredict Level 1 instruction miss* Level 1 data miss* Level 2 miss* Instruction TLB miss* Data TLB miss*

* Cache and TLB ratios are per instructions retired

5.00.40.70.30.040.7

25

HARDWARE-LEVEL CHARACTERISTICS Moderately high CPI

Pentium III capable of issuing 3 instructions/cycle Reason: a significant number of difficult-to-predict branches

traversing of dynamic data structures data dependent control flow

in newer Pentium 4 processor Same workload CPI is nearly twice Approximately the same branch prediction performance Even though Pentium 4

can issue more instructions concurrently has superior branch prediction logic

Google workload does not much contain exploitable instruction-level parallelism

(ILP)

26

HARDWARE-LEVEL CHARACTERISTICS (Cont.)

To exploit parallelism: Trivially parallelizable computation

in processing of queries requires little communication

already done using large number of inexpensive nodes at the cluster

level

Thread-level parallelism at the microarchitecture level Simultaneous multithreading (SMT) systems Chip multiprocessor (CMP) systems

27

HARDWARE-LEVEL CHARACTERISTICS (Cont.)

Simultaneous multithreading (SMT) Experiments with a dual-context (SMT) Intel Xeon

processor more than 30% performance improvement over a

single-context setup at the upper bound of improvements reported by Intel

for their SMT implementation

28

HARDWARE-LEVEL CHARACTERISTICS (Cont.) Chip multiprocessor (CMP) architectures

such as Hydra and Piranha Multiple (four to eight) simpler, in-order, short-pipeline cores

to replace a complex high-performance core Penalties of in-order execution

minor because of little ILP in the Google application Shorter pipelines

reduce/eliminate branch mispredict penalties Available thread level parallelism

can allow near-linear speedup With the number of cores

A shared L2 cache of reasonable size can speed up inter-processor communication

29

MEMORY SYSTEM Table

Main memory system performance parameters

Good performance for the instruction cache & instruction translation look-aside buffer due to relatively small inner-loop code size

Index data blocks No temporal locality

due to size of data and unpredictability in access patterns Benefit from spatial locality

Hardware prefetching or larger cache lines can be used Good overall cache hit ratios (even for relatively modest

cache sizes)

30

INSTRUCTION LEVEL MEASUREMENTS ON THE INDEX SERVER

Characteristic (On a 1-GHz dual processor Pentium III system)

Value

Cycles per instruction 1.1

Ratios (percentage) Branch mispredict Level 1 instruction miss* Level 1 data miss* Level 2 miss* Instruction TBL miss* Data TBL miss*

* Cache and TLB ratios are per instructions retired

5.00.40.70.30.040.7

31

MEMORY SYSTEM (Cont.) Memory bandwidth

does not appear to be a bottleneck

A suitable memory system for the load a relatively modest sized L2 cache short L2 cache and memory latencies longer (perhaps 128 byte) cache lines

32

SUMMARY Google infrastructure

Massively large cluster of inexpensive machines vs a smaller number of large-scale shared memory

machines Useful when computation-to-communication ratio is

low Communication patterns or data partitioning are

dynamic or hard to predict Total cost of ownership is much greater than

hardware costs (due to management overhead and software licensing prices)

in this cases, they justify their high prices None of these requirements apply at Google

33

SUMMARY (Cont.) Google

Partitions index data and computation to minimize communication to evenly balance the load across servers

Produces its software in-house Minimizes system management overhead through extensive

automation and monitoring Hardware costs become important

Deploys many small multiprocessors Faults effect smaller pieces of the system vs large-scale shared-memory machines

which do not handle individual hardware component or software failures enough

Most fault types causing a full system crash

34

SUMMARY (Cont.) It appears there are few applications like Google

requiring many thousands of servers and petabytes of storage

However, many applications share the characteristics of Focusing on price/performance Ability to run on servers without private state (so servers

can be replicated) allowing a PC-based cluster architecture e.g. high-volume Web servers, application servers

that are computationally intensive but essentially stateless

35

SUMMARY (Cont.) At Google’s scale

Some limits of massive server parallelism become apparent; e.g.:

Limited cooling capacity of commercial data centers Less-than-optimal fit of current CPUs for throughput-

oriented applications Nevertheless, using inexpensive PCs

increased the amount of computation that can be afforded to spend per query

increasing the amount of computation that can be afforded to spend per query

thus, helping to improve the search experience of the users

36

THANK YOU

the google cluster architecture presented by fatma canan pembe 2004800193

Documents