warehouse scale computer

Warehouse Scale Computer

Introduction

• Had scale been the only distinguishing feature of these systems we might simply refer to them as datacenters.

• Datacenters are buildings where multiple servers and communication gear are co-located because of their common environmental requirements and physical security needs, and for ease of maintenance.

• In that sense, a WSC is a type of datacenter.

Introduction• Traditional datacenters, however, typically host a large

number of relatively small- or medium-sized applications, each running on a dedicated hardware infrastructure that is de-coupled and protected from other systems in the same facility.

• Those datacenters host hardware and software for multiple organizational units or even different companies.

• Different computing systems within such a datacenter often have little in common in terms of hardware, software, or maintenance infrastructure, and tend not to communicate with each other at all.

Introduction• WSCs currently power the services offered by companies

such as Google, Amazon, Facebook, and Microsoft’s online services division.

• They differ significantly from traditional datacenters:1. They belong to a single organization. 2. Use a relatively homogeneous hardware and system

software platform3. Share a common systems management layer. • Often, much of the application, middleware, and system

software is built in-house compared to the predominance of third-party software running in conventional datacenters.

Introduction

• Most importantly, WSCs run a smaller number of very large applications (or Internet services), and the common resource management infrastructure allows significant deployment flexibility.

• The requirements of homogeneity, single-organization control, and enhanced focus on cost efficiency motivate designers to take new approaches in constructing and operating these systems.

Introduction

• Internet services must achieve high availability, typically aiming for at least 99.99% uptime (“four nines”, about an hour of downtime per year).

• Achieving fault-free operation on a large collection of hardware and system software is hard and is made more difficult by the large number of servers involved.

Introduction

• Although it might be theoretically possible to prevent hardware failures in a collection of 10,000 servers, it would surely be extremely expensive.

• Consequently, WSC workloads must be designed to gracefully tolerate large numbers of component faults with little or no impact on service level performance and availability.

ARCHITECTURAL OVERVIEW OF WSCS

• The hardware implementation of a WSC will differ significantly from one installation to the next.

• Even within a single organization such as Google, systems deployed in different years use different basic elements, reflecting the hardware improvements provided by the industry.

• However, the architectural organization of these systems has been relatively stable over the last few years.

• Therefore, it is useful to describe this general architecture at a high level as it sets the background for subsequent discussions.


• Being satisfied with neither the metric nor the US system, rack designers use “rack units” to measure the height of servers.

• 1U is 1.75 inches or 44.45 mm; a typical rack is 42U high.

• The 19-inch (48.26-cm) rack is still the standard framework to hold servers, despite this standard going back to railroad hardware from the 1930s.


Sketch of the typical elements in warehouse-scale systems: 1U server (left), 7’ rack withEthernet switch (middle), and diagram of a small cluster with a cluster-level Ethernet switch/router (right).


• Previous Figure depicts the high-level building blocks for WSCs.

• A set of low-end servers, typically in a 1U or blade enclosure format, are mounted within a rack and interconnected using a local Ethernet switch.

• These rack-level switches, which can use 1- or 10-Gbps links, have a number of uplink connections to one or more cluster-level (or datacenter-level) Ethernet switches.

• This second-level switching domain can potentially span more than ten thousand individual servers.


• In the case of a blade enclosure there is an additional first level of networking aggregation within the enclosure where multiple processing blades connect to a small number of networking blades through an I/O bus such as PCIe.


• A 7-foot (213.36-cm) rack offers 48 U, so it’s not a coincidence that the most popular switch for a rack is a 48-port Ethernet switch.

• This product has become a commodity that costs as little as $30 per port for a 1 Gbit/sec Ethernet link in 2011.

• Note that the bandwidth within the rack is the same for each server, so it does not matter where the software places the sender and the receiver as long as they are within the same rack.


• This flexibility is ideal from a software perspective.• These switches typically offer two to eight uplinks,

which leave the rack to go to the next higher switch in the network hierarchy.

• Thus, the bandwidth leaving the rack is 6 to 24 times smaller—48/8 to 48/2—than the bandwidth within the rack. This ratio is called oversubscription.

• Uplink has 48 / n times lower bandwidth, where n= # of uplink ports


• Alas, large oversubscription means programmers must be aware of the performance consequences when placing senders and receivers in different racks.

• This increased software-scheduling burden is another argument for network switches designed specifically for the datacenter.


Picture of a row of servers in a Google WSC, 2012.


• Array Switch• Switch that connects an array of racks.• Array switch should have 10 X the bisection

bandwidth of a rack switch⌘• Cost of n-port switch grows as n2

• Often utilize content addressable memory chips and FPGAs to support high-speed packet inspection.


• WSC Memory Hierarchy


• WSC Memory Hierarchy• Previous Figures shows the latency, bandwidth, and

capacity of memory hierarchy inside a WSC, and also shows the same data visually.

• Each server contains: 16 GBytes of memory with a 100-nanosecond access time and transfers at 20 GBytes/sec and 2 terabytes of disk that offers a 10-millisecond access time and transfers at 200 MBytes/sec.

• There are two sockets per board, and they share one 1 Gbit/sec Ethernet port.


• WSC Memory Hierarchy• Every pair of racks includes one rack switch and holds

80 2U servers.• Networking software plus switch overhead increases

the latency to DRAM to 100 microseconds and the disk access latency to 11 milliseconds.

• Thus, the total storage capacity of a rack is roughly 1 terabyte of DRAM and 160 terabytes of disk storage.

• The 1 Gbit/sec Ethernet limits the remote bandwidth to DRAM or disk within the rack to 100 MBytes/sec.


• WSC Memory Hierarchy• The array switch can handle 30 racks, so storage

capacity of an array goes up by a factor of 30: 30 terabytes of DRAM and 4.8 petabytes of disk.

• The array switch hardware and software increases latency to DRAM within an array to 500 microseconds and disk latency to 12 milliseconds.

• The bandwidth of the array switch limits the remote bandwidth to either array DRAM or array disk to 10 MBytes/sec.


• WSC Memory Hierarchy• Previous figures show that network overhead

dramatically increases latency from local DRAM to rack DRA M and array DRAM, but both still have more than 10 times better latency than the local disk.

• The network collapses the difference in bandwidth between rack DRAM and rack disk and between array DRAM and array disk.


• WSC Memory Hierarchy• What is the average latency assuming that

90% of accesses are local to the server, 9% are outside the server but local to the rack , and 1% are outside the rack but within the array?

• (90%x0.1)+(9%100)+(1%x300)=12.09 msec


• WSC Memory Hierarchy• How long does it take to transfer 1000MB between disks

within the server, between servers in the rack, and between servers in different racks of an array?

• Within server: 1000/200=5 sec• Within rack: 1000/100=10 sec• Within array: 1000/10= 100 sec


• The WSC needs 20 arrays to reach 50,000 servers, so there is one more level of the networking hierarchy.

• Next Figure shows the conventional Layer 3 routers to connect the arrays together and to the Internet.


The Layer 3 network used to link arrays together and to the Internet [Greenberg et al. 2009].Some WSCs use a separate border router to connect the Internet to the datacenter Layer 3 switches.


Sample three-stage fat tree topology.


• Another way to tackle network scalability is to offload some traffic to a special-purpose network.

• For example, if storage traffic is a big component of overall traffic, we could build a separate network to connect servers to storage units.

• If that traffic is more localized (not all servers need to be attached to all storage units) we can build smaller-scale networks, thus reducing costs.


• Historically, that’s how all storage was networked: a SAN (storage area network) connected servers to disks, typically using FibreChannel networks rather than Ethernet.

• Today, Ethernet is becoming more common since it offers comparable speeds, and protocols such as FCoE (FibreChannel over Ethernet) and iSCSI (SCSI over IP) allow Ethernet networks to integrate well with traditional SANs.


• WSCs using VMs (or, more generally, task migration) pose further challenges to networks since connection endpoints (i.e., IP address/port combinations) can move from one physical machine to another.

• Typical networking hardware as well as network management software doesn’t anticipate such moves and in fact often explicitly assume that they’re not possible.


• For example, network designs often assume that all machines in a given rack have IP addresses in a common subnet, which simplifies administration and minimizes the number of required forwarding table entries routing tables.

• More importantly, frequent migration makes it impossible to manage the network manually--programming network elements needs to be automated, so the same cluster manager that decides the placement of computations also needs to update the network state.


• The Need of SDN• The need for a programmable network has led

to much interest in OpenFlow [http://www.openflow.org/] and software-defined networking (SDN), which moves the network control plane out of the individual switches into a logically centralized controller.


• The Need of SDN• Controlling a network from a logically centralized server

offers many advantages; in particular, common networking algorithms such as computing reachability, shortest paths, or max-flow traffic placement become much simpler to solve, compared to their implementation in current networks where each individual router must solve the same problem while dealing with limited visibility (direct neighbors only), inconsistent network state (routers that are out of synch with the current network state), and many independent and concurrent actors (routers).


• STORAGE• Disk drives or Flash devices are connected directly to

each individual server and managed by a global distributed file system (such as Google’s GFS) or they can be part of Network Attached Storage (NAS) devices directly connected to the cluster-level switching fabric.

• A NAS tends to be a simpler solution to deploy initially because it allows some of the data management responsibilities to be outsourced to a NAS appliance vendor.


• STORAGE• Keeping storage separate from computing nodes also

makes it easier to enforce quality of service guarantees since the NAS runs no compute jobs besides the storage server.

• In contrast, attaching disks directly to compute nodes can reduce hardware costs (the disks leverage the existing server enclosure) and improve networking fabric utilization (each server network port is effectively dynamically shared between the computing tasks and the file system).


• STORAGE• The replication model between these two

approaches is also fundamentally different. • A NAS tends to provide high availability through

replication or error correction capabilities within each appliance, whereas systems like GFS implement replication across different machines and consequently will use more networking bandwidth to complete write operations.


• STORAGE• However, GFS-like systems are able to keep data

available even after the loss of an entire server enclosure or rack and may allow higher aggregate read bandwidth because the same data can be sourced from multiple replicas.

• Trading off higher write overheads for lower cost, higher availability, and increased read bandwidth was the right solution for many of Google’s early workloads.


• STORAGE• An additional advantage of having disks co-located

with compute servers is that it enables distributed system software to exploit data locality.

• Given how networking performance has outpaced disk performance for the last decades such locality advantages are less useful for disks but may remain beneficial to faster modern storage devices such as those using Flash storage.


• STORAGE• NAND Flash technology has made Solid State Drives (SSDs)

affordable for a growing class of storage needs in WSCs. • While the cost per byte stored in SSDs will remain much

higher than in disks for the foreseeable future, many Web services have I/O rates that cannot be easily achieved with disk based systems.

• Since SSDs can deliver IO rates many orders of magnitude higher than disks, they are increasingly displacing disk drives as the repository of choice for databases in Web services.


HDD interiors almost resemble a high-tech record player.

OCZ's Vector SSD is one of the fastest around

The OCZ RevoDrive Hybrid.


• STORAGE• Types of NAND Flash• There are primarily two types of NAND Flash widely used today,

Single-Level Cell (SLC) and Multi-Level Cell (MLC). NAND Flash stores data in a large array of cells.

• Each cell can store data — one bit for cell for SLC NAND, and two bits per cell for MLC. So, SLC NAND would store a “0” or “1” in each cell, and MLC NAND would store “00”, “01”, “10”, or “11” in each cell.

• SLC and MLC NAND offer different levels of performance and endurance characteristics at different price points, with SLC being the higher performing and more costly of the two.


• WSC STORAGE• The data manipulated by WSC workloads tends to fall into

two categories: • data that is private to individual running tasks and data that

is part of the shared state of the distributed workload. • Private data tends to reside in local DRAM or disk, is rarely

replicated, and its management is simplified by virtue of its single user semantics.

• In contrast, shared data must be much more durable and is accessed by a large number of clients, and thus requires a much more sophisticated distributed storage system.


• WSC STORAGE• UNSTRUCTURED WSC STORAGE• Google’s GFS is an example of a storage system with a simple

file-like abstraction (Google’s Colossus system has since replaced GFS, but follows a similar architectural philosophy so we choose to describe the better known GFS here).

• GFS was designed to support the Web search indexing system (the system that turned crawled Web pages into index files for use in Web search), and therefore focuses on high throughput for thousands of concurrent readers/writers and robust performance under high hardware failures rates.


• WSC STORAGE• UNSTRUCTURED WSC STORAGE• GFS users typically manipulate large quantities of

data, and thus GFS is further optimized for large operations.

• The system architecture consists of a master, which handles metadata operations, and thousands of chunk server (slave) processes running on every server with a disk drive, to manage the data chunks on those drives.


• WSC STORAGE• UNSTRUCTURED WSC STORAGE• In GFS, fault tolerance is provided by replication across

machines instead of within them, as is the case in RAID systems.

• Cross-machine replication allows the system to tolerate machine and network failures and enables fast recovery, since replicas for a given disk or machine can be spread across thousands of other machines.


• WSC STORAGE• UNSTRUCTURED WSC STORAGE• Although the initial version of GFS only

supported simple replication, today’s version (Colossus) has added support for more space-efficient Reed-Solomon codes, which tend to reduce the space overhead of replication by roughly a factor of two over simple replication for the same level of availability.


• WSC STORAGE• UNSTRUCTURED WSC STORAGE• An important factor in maintaining high availability is distributing file chunks across

the whole cluster in such a way that a small number of correlated failures is extremely unlikely to lead to data loss.

• GFS takes advantage of knowledge about the known possible correlated fault scenarios and attempts to distribute replicas in a way that avoids their co-location in a single fault domain.

• Wide distribution of chunks across disks over a whole cluster is also key for speeding up recovery.

• Since replicas of chunks in a given disk are spread across possibly all machines in a storage cluster, reconstruction of lost data chunks is performed in parallel at high speed.

• Quick recovery is important since long recovery time windows leave under-replicated chunks vulnerable to data loss should additional faults hit the cluster.


• WSC STORAGE• STRUCTURED WSC STORAGE• The simple file abstraction of GFS and Colossus may suffice for

systems that manipulate large blobs of data, but application developers also need the WSC equivalent of database-like functionality, where data sets can be structured and indexed for easy small updates or complex queries.

• Blobs (binary large object, basic large object, BLOB, or BLOb) is a collection of binary data stored as a single entity in a database management system. Blobs are typically images, audio or other multimedia objects, though sometimes binary executable code is stored as a blob.


• WSC STORAGE• STRUCTURED WSC STORAGE• Structured distributed storage systems such as Google’s BigTable and

Amazon’s Dynamo were designed to fulfill those needs. • Compared to traditional database systems, BigTable and Dynamo

sacrifice some features, such as richness of schema representation and strong consistency, in favor of higher performance and availability at massive scales.

• BigTable, for example, presents a simple multi-dimensional sorted map consisting of row keys (strings) associated with multiple values organized in columns, forming a distributed sparse table space. Column values are associated with timestamps in order to support versioning and time-series.


• WSC STORAGE• STRUCTURED WSC STORAGE• The choice of eventual consistency in BigTable and Dynamo shifts the

burden of resolving temporary inconsistencies to the applications using these systems.

• A number of application developers within Google have found it inconvenient to deal with weak consistency models and the limitations of the simple data schemes in BigTable.

• Second-generation structured storage systems such as MegaStore and subsequently Spanner have been designed to address such

• concerns. • Both MegaStore and Spanner provide richer schemas and SQL-like

functionality while providing simpler, stronger consistency models.


Weak Consistency• The protocol is said to support weak

consistency if:• All accesses to synchronization variables

are seen by all processes (or nodes, processors) in the same order (sequentially) - these are synchronization operations.

• Accesses to critical sections are seen sequentially.

• All other accesses may be seen in different order on different processes (or nodes, processors).

• The set of both read and write operations in between different synchronization operations is the same in each process.

Strong Consistency• The protocol is said to support

strong consistency if:• All accesses are seen by all

parallel processes (or nodes, processors etc.) in the same order (sequentially)

• Therefore only one consistent state can be observed, as opposed to weak consistency, where different parallel processes (or nodes etc.) can perceive variables in different states.


• WSC STORAGE• INTERPLAY OF STORAGE AND NETWORKING TECHNOLOGY• The success of WSC distributed storage systems can be partially

attributed to the evolution of datacenter networking fabrics.• The observe that the gap between networking and disk

performance has widened to the point that disk locality is no longer relevant in intra-datacenter computations.

• This observation enables dramatic simplifications in the design of distributed disk-based storage systems as well as utilization improvements since any disk byte in a WSC facility can in principle be utilized by any task regardless of their relative locality.


• DATACENTER TIER CLASSIFICATIONS AND SPECIFICATIONS• The design of a datacenter is often classified as belonging

to “Tier I–IV”. • The Uptime Institute, a professional services organization

specializing in datacenters, and the Telecommunications Industry Association (TIA), an industry group accredited by ANSI and made up of approximately 400 member companies, both advocate a 4-tier classification loosely based on the power distribution, uninterruptible power supply (UPS), cooling delivery and redundancy of the datacenter.


• DATACENTER TIER CLASSIFICATIONS AND SPECIFICATIONS• Tier I datacenters have a single path for power distribution, UPS, and

cooling distribution, without redundant components.• Tier II adds redundant components to this design (N + 1), improving

availability.• Tier III datacenters have one active and one alternate distribution path

for utilities. Each path has redundant components and are concurrently maintainable, that is, they provide redundancy even during maintenance.

• Tier IV datacenters have two simultaneously active power and cooling distribution paths, redundant components in each path, and are supposed to tolerate any single equipment failure without impacting the load.


• DATACENTER TIER CLASSIFICATIONS AND SPECIFICATIONS• The Uptime Institute’s specification is generally performance-

based (with notable exceptions for the amount of backup diesel fuel, water storage, and ASHRAE temperature design points ).

• The specification describes topology rather than prescribing a specific list of components to meet the requirements, so there are many architectures that can achieve a given tier classification.

• In contrast, TIA-942 is very prescriptive and specifies a variety of implementation details such as building construction, ceiling height, voltage levels, types of racks, and patch cord labeling, for example.


• DATACENTER TIER CLASSIFICATIONS AND SPECIFICATIONS• Formally achieving tier classification certification is difficult

and requires a full review from one of the granting bodies, and most datacenters are not formally rated.

• Most commercial datacenters fall somewhere between tiers III and IV, choosing a balance between construction cost and reliability.

• Generally, the lowest of the individual subsystem ratings (cooling, power, etc.) determines the overall tier classification of the datacenter.


• DATACENTER TIER CLASSIFICATIONS AND SPECIFICATIONS• Real-world datacenter reliability is strongly influenced by the

quality of the organization running the datacenter, not just by the design.

• The Uptime Institute reports that over 70% of datacenter outages are the result of human error, including management decisions on staffing, maintenance, and training.

• Theoretical availability estimates used in the industry range from 99.7% for tier II datacenters to 99.98% and 99.995% for tiers III and IV, respectively.


• DATACENTER ENERGY EFFICIENCY• The broadest definition of WSC energy efficiency would

measure the energy used to run a particular workload (say, to sort a petabyte of data).

• Unfortunately, no two companies run the same workload and real-world application mixes change all the time so it is hard to benchmark real-world WSCs this way.

• Thus, even though such benchmarks have been contemplated as far back as 2008 they haven’t yet been found and we doubt they ever will.


• DATACENTER ENERGY EFFICIENCY• However, it is useful to view energy efficiency

as the product of three factors we can independently measure and optimize:

• The first term (a) measures facility efficiency, the second server power conversion efficiency, and the third measures the server’s architectural efficiency.


• DATACENTER ENERGY EFFICIENCY• THE PUE METRIC• Power usage effectiveness (PUE) reflects the quality

of the datacenter building infrastructure itself, and captures the ratio of total building power to IT power (the power consumed by the actual computing and network equipment, etc.). (Sometimes IT power is also referred to as “critical power.”)

• PUE = (Facility power) / (IT Equipment power)


• DATACENTER ENERGY EFFICIENCY• THE PUE METRIC• PUE has gained a lot of traction as a datacenter

efficiency metric since widespread reporting started around 2009.

• We can easily measure PUE by adding electrical meters to the lines powering the various parts of a datacenter, thus determining how much power is used by chillers or a UPS.


• DATACENTER ENERGY EFFICIENCY• THE PUE METRIC• Historically, the PUE for the average

datacenter has been embarrassingly poor. • According to a 2006 study, 85% of current

datacenters were estimated to have a PUE of greater than 3.0.


• DATACENTER ENERGY EFFICIENCY• THE PUE METRIC• In other words, the building’s mechanical and electrical

systems consumed twice as much power as the actual computing load! Only 5% had a PUE of 2.0 or better.

• A subsequent EPA survey of over 100 datacenters reported an average PUE value of 1.91, and a 2012 Uptime Institute survey of over 1100 datacenters covering a range of geographies and datacenter sizes reported an average PUE value between 1.8 and 1.89.


Uptime Institute survey of PUE for 1100+ datacenters.

ARCHITECTURAL OVERVIEW OF WSCS• SOURCES OF EFFICIENCY LOSSES IN

DATACENTERS• For illustration, let us walk through the losses

in a typical datacenter.


• DATACENTER E• The second term (b) accounts for overheads inside servers or

other IT equipment using a metric analogous to PUE, server PUE (SPUE).

• SPUE consists of the ratio of total server input power to its useful power, where useful power includes only the power consumed by the electronic components directly involved in the computation: motherboard, disks, CPUs, DRAM, I/O cards, and so on.

• Substantial amounts of power may be lost in the server’s power supply, voltage regulator modules (VRMs), and cooling fans.


• DATACENTER E• The product of PUE and SPUE constitutes an

accurate assessment of the end-to-end electromechanical efficiency of a WSC. Such a true (or total) PUE metric (TPUE), defined as PUE.


• DATACENTER E• MEASURING ENERGY EFFICIENCY• Similarly, server-level benchmarks such as Joulesort and

SPECpower characterize other aspects of computing efficiency.• Joulesort measures the total system energy to perform an out-

of-core sort and derives a metric that enables the comparison of systems ranging from embedded devices to supercomputers.

• SPECpower focuses on server-class systems and computes the performance-to-power ratio of a system running a typical business application on an enterprise Java platform.


• DATACENTER E• MEASURING ENERGY EFFICIENCY• Two separate benchmarking efforts aim to characterize

the efficiency of storage systems: the Emerald Program by the Storage Networking Industry Association (SNIA) and the SPC-2/E by the Storage Performance Council.

• Both benchmarks measure storage servers under different kinds of request activity and report ratios of transaction throughput per Watt.


• Cost of a WSC• To better understand the potential impact of energy-

related optimizations, let us examine the total cost of ownership (TCO) of a datacenter.

• At the top level, costs split up into capital expenses (Capex) and operational expenses (Opex).

• Capex refers to investments that must be made upfront and that are then depreciated over a certain time frame; examples are the construction cost of a datacenter or the purchase price of a server.


• Cost of a WSC• Opex refers to the recurring monthly costs of

actually running the equipment, excluding depreciation: electricity costs, repairs and maintenance, salaries of on-site personnel, and so on.

• Thus, we have:TCO = datacenter depreciation + datacenter Opex + server depreciation + server Opex


• Cost of a WSC


• Cost of a WSC• The monthly depreciation cost (or amortization cost) that

results from the initial construction expense depends on the duration over which the investment is amortized (which is related to its expected lifetime) and the assumed interest rate.

• Typically, datacenters are depreciated over periods of 10–15 years.

• Under U.S. accounting rules, it is common to use straight-line depreciation where the value of the asset declines by a fixed amount each month.


• Cost of a WSC• For example, if we depreciate a $12/W datacenter

over 12 years, the depreciation cost is $0.08/W per month.

• If we had to take out a loan to finance construction at an interest rate of 8%, the associated monthly interest payments add an additional cost of $0.05/W, for a total of $0.13/W per month.

• Typical interest rates vary over time, but many companies will pay interest in the 7–12% range.

ARCHITECTURAL OVERVIEW OF WSCS• Cost of a WSC

To put the cost of energy into perspective, Hamilton did a case study to estimate the costs of a WSC. He determined that the CAPEX of this 8 MW facility was $88M, andthat the roughly 46,000 servers and corresponding networking equipment added another$79M to the CAPEX for the WSC.

ARCHITECTURAL OVERVIEW OF WSCS• Cost of a WSC

•We can now price the total cost of energy, since U.S . accounting rules allow us to convert CAPEX into OPEX. •We can just amortize CAPEX as a fixed amount each month for the effective life of the equipment. •Note that the amortization rates differ significantly, from 10 years for the facility to 4 years for the networking equipment and 3 years for the servers. •Hence, the WSC facility lasts a decade, but you need to replace the servers every 3 years and the networking equipment every 4 years. •By amortizing the CAPEX, Hamilton came up with a monthly OPEX, including accounting for the cost of borrowing money (5% annually) to pay for the WSC. •At $3.8M, the monthly OPEX is about 2% of the CAPEX.


• A Google Warehouse-Scale Computer• Since many companies with WSCs are competing vigorously in

the marketplace, up until recently, they have been reluctant to share their latest innovations with the public (and each other).

• In 2009, Google described a state-of-the-art WSC as of 2005. • Google graciously provided an update of the 2007 status of

their WS C, making this section the most up-to-date description of a Google WS C.

• Even more recently, Facebook described their latest datacenter as part of

• http://opencompute.org.


• A Google Warehouse-Scale Computer• Containers• Both Google and Microsoft have built WSCs using shipping

containers. • The idea of building a WSC from containers is to make WSC design

modular. • Each container is independent, and the only external connections

are networking, power, and water. • The containers in turn supply networking, power, and cooling to the

servers placed inside them, so the job of the WSC is to supply networking, power, and cold water to the containers and to pump the resulting warm water to external cooling towers and chillers.


• A Google Warehouse-Scale Computer• Containers• Diagram is a cutaway drawing of a Google container. • A container holds up to 1160 servers, so 45 containers have

space for 52,200 servers. (This WSC has about 40,000 servers.)

• The servers are stacked 20 high in racks that form two long rows of 29 racks (also called bays) each, with one row on each side of the container.

• The rack switches are 48-port, 1 Gbit/sec Ethernet switches, which are placed in every other rack.


• A Google Warehouse-Scale Computer• Containers• The Google WSC that we are looking at contains 45 40-foot-

long containers in a 300- foot by 250-foot space, or 75,000 square feet (about 7000 square meters).

• To fit in the warehouse, 30 of the containers are stacked two high, or 15 pairs of stacked containers.

• Although the location was not revealed, it was built at the time that Google developed WSCs in The Dallas, Oregon, which provides a moderate climate and is near cheap hydroelectric power and Internet backbone fiber.


• A Google Warehouse-Scale Computer• Containers• This WSC offers 10 megawatts with a PUE of 1.23 over the prior 12

months. • Of that 0.230 of PUE overhead, 85% goes to cooling losses (0.195 PUE)

and 15% (0.035) goes to power losses. • The system went live in November 2005, and this section describes its

state as of 2007.• A Google container can handle up to 250 kilowatts. That means the

container can handle 780 watts per square foot (0.09 square meters), or 133 watts per square foot across the entire 75,000-square-foot space with 40 containers.

• However, the containers in this WSC average just 222 kilowatts


• A Google Warehouse-Scale Computer• Containers


• A Google Warehouse-Scale Computer• Containers• Servers In A Google WSC• The server in Figure 6.21 has two sockets, each containing a dual-

core AMD Opteron processor running at 2.2 GHz. The photo shows eight DIMMS, and these servers are typically deployed with 8 GB of DDR2 DRA M.

• A novel feature is that the memory bus is down clocked to 533 MHz from the standard 666 MHz since the slower bus has little impact on performance but a significant impact on power.

• The baseline design has a single network interface card (NIC) for a 1 Gbit/sec Ethernet link.


• A Google Warehouse-Scale Computer• Containers• Servers In A Google WSC• Although the photo in Figure 6.21 shows two SATA disk drives, the baseline

server has just one. • The peak power of the baseline is about 160 watts, and idle power is 85 watts.• This baseline node is supplemented to offer a storage (or “diskfull”) node. • First, a second tray containing 10 S ATA disks is connected to the server. • To get one more disk, a second disk is placed into the empty spot on the

motherboard, giving the storage node 12 S ATA disks. • Finally, since a storage node could saturate a single 1 Gbit/sec Ethernet link, a

second Ethernet NIC was added. • Peak power for a storage node is about 300 watts, and it idles at 198 watts.

warehouse scale computer

Education