the google cluster architecture presented by fatma canan pembe 2004800193
TRANSCRIPT
The Google Cluster Architecture
Presented byFatma Canan Pembe
2004800193
2
PURPOSE To overview the computer architecture of Google
One of the mostly known and used search engines today How it can achieve such a processing power under such
big workload
3
OUTLINE Introduction Cluster architectures Google architecture overview
Serving a Google query Design principles of Google clusters
Leveraging commodity parts Power problem Hardware-level characteristics Memory system
Summary
4
INTRODUCTION Search engines
require high amounts of computation per request
A single query on Google (on average) reads hundreds of megabytes of data consumes tens of billions of CPU cycles
A peak request stream on Google Thousands of queries per second requires an infrastructure comparable in size
to largest supercomputer installations
5
INTRODUCTION (Cont.) Google
Combines more than 15,000 commodity-class PCs Instead of a smaller number of high-end servers
Most important factors that influenced the design Energy efficiency Price-performance ratio
Google application affords easy parallelization Different queries can run on different processors A single query can use multiple processors
because the overall index is partitioned
6
CLUSTER ARCHITECTURES Cluster
collection of independent computers using switched network to provide a common service
Many mainframe applications run more "loosely coupled" machines than shared memory machines databases, file servers, Web servers, simulations, etc. Often need to be highly available, requiring error
tolerance and repairability Often need to scale
7
DISADVANTAGES OF CLUSTERS Cost of administering a cluster of N machines
administering N independent machines vs. cost of administering a shared address space N
processors multiprocessor administering 1 big machine
Clusters usually connected using I/O bus whereas multiprocessors usually connected on memory
bus Cluster of N machines has N independent
memories and N copies of OS but a shared address multi-processor allows 1 program
to use almost all memory
8
ADVANTAGES OF CLUSTERS Error isolation
separate address space limits contamination of error Repair
Easier to replace a machine without bringing down the system than in a shared memory multiprocessor
Scale easier to expand the system without bringing down the
application that runs on top of the cluster Cost
Large scale machine has low volume => fewer machines to spread development costs
vs. leverage high volume off-the-shelf switches and computers
Amazon, AOL, Google, Hotmail, and Yahoo rely on clusters of PCs to provide services used by millions of
people every day
9
GOOGLE ARCHITECTURE OVERVIEW Reliability
provided in software level rather than in server-class hardware so that commodity PCs can be used
to build a cluster at a low price
Design for best aggregate throughput rather than peak server response time
Building a reliable computing infrastructure from clusters of unreliable commodity PCs
10
SERVING A GOOGLE QUERY When user enters a query
e.g. www.google.com/search?q=ieee+society User browser
Domain Name System (DNS) lookup to map to a particular IP address
Multiple Google clusters distributed worldwide each cluster with a few thousand machines to handle query traffic
11
SERVING A GOOGLE QUERY (Cont.) Geographically distributed setup
protects against catastrophic failures DNS-based load-balancing system selects a cluster
according to user’s geographic proximity available capacity at various clusters
User’s browser sends HTTP request to one of the clusters thereafter, processing local to that cluster
12
SERVING A GOOGLE QUERY (Cont.) A hardware based load balancer in each cluster
monitors available Google Web Servers (GWSs) performs local load balancing of requests
A GWS machine coordinates the query execution returns results as HTML response
13
SERVING A GOOGLE QUERY (Cont.)
14
SERVING A GOOGLE QUERY (Cont.) Query execution phases
1. The index servers determine the relevant documents by consulting an inverted index challenging due to large amount of data
Raw documents -> several tens of terabytes of data Inverted index -> many terabytes of data
Fortunately, search is highly parallelizable by dividing the index into pieces (index shards) For each shard, a pool of machines serve
improving reliability a load balancer employed
2. The document servers determine the actual URLs and query-specific summaries of the found documents
Again documents are divided into shards
15
DESIGN PRINCIPLES OF GOOGLE CLUSTERS Software level reliability
No fault-tolerant hardware features; e.g. redundant power supplies A redundant array of inexpensive disks (RAID)
instead tolerate failures in software Use replication
for better request throughput and availability Price/performance beats peak performance
CPUs giving the best performance per unit price Not the CPUs with best absolute performance
Using commodity PCs reduces the cost of computation
16
FIRST GOOGLE SERVER RACK In the Computer History Museum (from year 1999) Each tray contains eight 22GB hard drives and one
power supply
17
LEVERAGING COMMODITY PARTS Google’s racks
consist of 40 to 80 x86-based servers Server components similar to mid-range desktop PC
except for larger disk drives Ranging
from single processor 533-MHz Intel-Celeron based servers
to dual 1.4-GHz Intel Pentium III servers Servers on each rack interconnennected via 100
Mbps Ethernet All racks interconnectd via a gigabit switch
18
LEVERAGING COMMODITY PARTS (Cont.)
Selection criterion Cost per query [Capital expense (with depreciation) + operating costs
(hosting, system administration, repairs)] / performance
inexpensive PC-based clusters vs high-end multiprocessor servers Rack -> 176 2-GHz Xeon CPUs + 176 Gbytes RAM + 7
Tbytes of disk space = $278,000 Server -> 8 2-GHz Xeon CPUs + 64 Gbytes RAM + 8
Tbytes of disk space = $758,000
19
LEVERAGING COMMODITY PARTS (Cont.)
Multiprocessor server about 3 times more expensive 22 times fewer CPUs 3 times less RAM
Cost difference of high-end server due to higher interconnect bandwidth reliability which are not necessary in Google’s highly redundant
architecture
20
THE POWER PROBLEM A mid-range server with dual 1.4-GHz Pentium III
processors 90 W of DC power
55 W for the two CPUs 10 W for a disk drive 25 W for DRAM and motherboard
Typical efficiency of an ATX power supply -> 75% means 120 W of AC power per server roughly 10 kW per rack
21
THE POWER PROBLEM (Cont.) A rack
25 ft2 of space Corresponding power density: 400 W/ ft2
With higher-end processors: 700 W/ft2
Typical power density for commercial data centers: between 70 and 150 W/ft2
Much lower than that required for PC clusters Special cooling or additional space required
to decrease power density to a tolerable level
22
THE POWER PROBLEM (Cont.) Reduced-power servers can be used; but
must be without a performance penalty must not be considerably more expensive
23
HARDWARE-LEVEL CHARACTERISTICS
Architectural characteristics of the Google query-serving application examined to determine hardware platforms
for best price/performance
Index server most heavily impacts the overall price/performance
24
INSTRUCTION LEVEL MEASUREMENTS ON THE INDEX SERVER
Characteristic (On a 1-GHz dual-processor Pentium III system)
Value
Cycles per instruction 1.1
Ratios (percentage) Branch mispredict Level 1 instruction miss* Level 1 data miss* Level 2 miss* Instruction TLB miss* Data TLB miss*
* Cache and TLB ratios are per instructions retired
5.00.40.70.30.040.7
25
HARDWARE-LEVEL CHARACTERISTICS Moderately high CPI
Pentium III capable of issuing 3 instructions/cycle Reason: a significant number of difficult-to-predict branches
traversing of dynamic data structures data dependent control flow
in newer Pentium 4 processor Same workload CPI is nearly twice Approximately the same branch prediction performance Even though Pentium 4
can issue more instructions concurrently has superior branch prediction logic
Google workload does not much contain exploitable instruction-level parallelism
(ILP)
26
HARDWARE-LEVEL CHARACTERISTICS (Cont.)
To exploit parallelism: Trivially parallelizable computation
in processing of queries requires little communication
already done using large number of inexpensive nodes at the cluster
level
Thread-level parallelism at the microarchitecture level Simultaneous multithreading (SMT) systems Chip multiprocessor (CMP) systems
27
HARDWARE-LEVEL CHARACTERISTICS (Cont.)
Simultaneous multithreading (SMT) Experiments with a dual-context (SMT) Intel Xeon
processor more than 30% performance improvement over a
single-context setup at the upper bound of improvements reported by Intel
for their SMT implementation
28
HARDWARE-LEVEL CHARACTERISTICS (Cont.) Chip multiprocessor (CMP) architectures
such as Hydra and Piranha Multiple (four to eight) simpler, in-order, short-pipeline cores
to replace a complex high-performance core Penalties of in-order execution
minor because of little ILP in the Google application Shorter pipelines
reduce/eliminate branch mispredict penalties Available thread level parallelism
can allow near-linear speedup With the number of cores
A shared L2 cache of reasonable size can speed up inter-processor communication
29
MEMORY SYSTEM Table
Main memory system performance parameters
Good performance for the instruction cache & instruction translation look-aside buffer due to relatively small inner-loop code size
Index data blocks No temporal locality
due to size of data and unpredictability in access patterns Benefit from spatial locality
Hardware prefetching or larger cache lines can be used Good overall cache hit ratios (even for relatively modest
cache sizes)
30
INSTRUCTION LEVEL MEASUREMENTS ON THE INDEX SERVER
Characteristic (On a 1-GHz dual processor Pentium III system)
Value
Cycles per instruction 1.1
Ratios (percentage) Branch mispredict Level 1 instruction miss* Level 1 data miss* Level 2 miss* Instruction TBL miss* Data TBL miss*
* Cache and TLB ratios are per instructions retired
5.00.40.70.30.040.7
31
MEMORY SYSTEM (Cont.) Memory bandwidth
does not appear to be a bottleneck
A suitable memory system for the load a relatively modest sized L2 cache short L2 cache and memory latencies longer (perhaps 128 byte) cache lines
32
SUMMARY Google infrastructure
Massively large cluster of inexpensive machines vs a smaller number of large-scale shared memory
machines Useful when computation-to-communication ratio is
low Communication patterns or data partitioning are
dynamic or hard to predict Total cost of ownership is much greater than
hardware costs (due to management overhead and software licensing prices)
in this cases, they justify their high prices None of these requirements apply at Google
33
SUMMARY (Cont.) Google
Partitions index data and computation to minimize communication to evenly balance the load across servers
Produces its software in-house Minimizes system management overhead through extensive
automation and monitoring Hardware costs become important
Deploys many small multiprocessors Faults effect smaller pieces of the system vs large-scale shared-memory machines
which do not handle individual hardware component or software failures enough
Most fault types causing a full system crash
34
SUMMARY (Cont.) It appears there are few applications like Google
requiring many thousands of servers and petabytes of storage
However, many applications share the characteristics of Focusing on price/performance Ability to run on servers without private state (so servers
can be replicated) allowing a PC-based cluster architecture e.g. high-volume Web servers, application servers
that are computationally intensive but essentially stateless
35
SUMMARY (Cont.) At Google’s scale
Some limits of massive server parallelism become apparent; e.g.:
Limited cooling capacity of commercial data centers Less-than-optimal fit of current CPUs for throughput-
oriented applications Nevertheless, using inexpensive PCs
increased the amount of computation that can be afforded to spend per query
increasing the amount of computation that can be afforded to spend per query
thus, helping to improve the search experience of the users
36
THANK YOU