data center scale computing

47
Data Center Scale Computing Presentation by: Ken Bakke Samantha Orogvany John Greene If computers of the kind I have advocated become the computers of the future, then computing may someday be organized as a public utility just as the telephone system is a public utility. . . . The computer utility could become the basis of a new and important industry. John McCarthy MIT centennial celebration (1961)

Upload: azriel

Post on 26-Feb-2016

42 views

Category:

Documents


2 download

DESCRIPTION

Data Center Scale Computing. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Center Scale Computing

Data Center Scale Computing

Presentation by:Ken BakkeSamantha OrogvanyJohn Greene

If computers of the kind I have advocated become the computers of the future, then computing may someday be organized as a public utility just as the telephone system is a public utility. . . . The computer utility could become the basis of a new and important industry.John McCarthyMIT centennial celebration (1961)

Page 2: Data Center Scale Computing

Outline● Introduction● Data Center System Components● Design and Storage Considerations● Data Center Power supply● Data Center Cooling ● Data center failures and fault tolerances● Data center repairs● Current challenges● current research, trends, etc● Conclusion

Page 3: Data Center Scale Computing

Data Center VS Warehouse Scale Computer

Data center•Provide colocated equipment•Consolidate heterogeneous

computers•Serve wide variety of customers•Binaries typically run on a small

number of computers•Resources are partitioned and

separately managed•Facility and computing resources

are designed separately•Share security, environmental and

maintenance resources

Warehouse-scale computer• Designed to run massive internet

applications• Individual applications run on

thousands of computers• Homogeneous hardware and

system software• Central management for a common

resource pool• The design of the facility and the

computer hardware is integrated

Page 4: Data Center Scale Computing

Need for Warehouse-scale Computers

• Renewed focus on client-side consumption of web resources

• Constantly increasing numbers of web users• Constantly expanding amounts of

information• Desire for rapid response for end user• Focus on cost reduction delivering massive

applications.• Increased interest in Infrastructure as a

Service (Iaas)

Page 5: Data Center Scale Computing

Performance and Availability Techniques

• Replication• Reed-Solomon codes• Sharding• Load-balancing• Health checking• Application specific compression• Eventual consistency• Centralized control• Canaries• Redundant execution and tail tolerance

Page 6: Data Center Scale Computing

Major system components

• Typical server is 4 CPU - 8 Dual threaded cores yielding 32 cores• Typical rack - 40 servers & 1 or 10 Gbps ethernet switch• Cluster containing cluster switch and 16 - 64 racks

A cluster may contain tens of thousands of processing threads

Page 7: Data Center Scale Computing

Low-end Server vs SMP• Latency 1000 time faster in SMP• Less impact on applications too large for single server

Performance advantage of a cluster built with large SMP server nodes (128-core SMP) over a cluster with the same number of processor cores built with low-end server nodes (four-core SMP), for clusters of varying size.

Page 8: Data Center Scale Computing

Brawny vs Wimpy Advantages of wimpy computers•Multicore CPUs carry a premium cost

of 2-5 times vs multiple smaller CPUs•Memory and IO bound applications do

not take advantage of faster CPUs•Slower CPUs are more power efficientDisadvantages of wimpy computer•Increasing parallelism is

programmatically difficult•Programming costs increase•Networking requirements increase•Less tasks / smaller size creates

loading difficulties•Amdahl’s law impacts

Page 9: Data Center Scale Computing

Design Considerations• Software design and improvements can be made to align with architectural

choices• Resource requirements and utilization can be balanced among all

applicationso Spare CPU cycles can be used for process intensive applicationso Spare storage can be used for archival purposes

• Fungible resources are more efficient• Workloads can be distributed to fully utilize servers• Focus on cost-effectiveness

Smart programmers may be able to restructure algorithms to match a more inexpensive design.

Page 10: Data Center Scale Computing

Storage ConsiderationsPrivate Data•Local DRAM, SSD or DiskShared State Data•High throughput for thousands of users•Robust performance tolerant to errors•Unstructure Storage - (Google - GFS)

o Master plus thousnads of “chunk” serverso Utilizes every system with a disk driveo Cross machine replicationo Robust performance tolerant to errors

•Structured Storageo Big Table provides Row, Key, Timestamp mapping to byte arrayo Trade-offs favor high performance and massive availabilityo Eventual consistency model leaves applications managing

consistency issues

Page 11: Data Center Scale Computing

Google File System

Page 12: Data Center Scale Computing

WSC Network ArchitectureLeaf Bandwidth•Bandwidth between servers in common rack•Typically managed with a commodity switch•Easily increased by increasing number of ports or speed of ports

Bisection Bandwidth•Bandwidth between the two halves of a cluster•Matching leaf bandwidth requires as many uplinks to fabric as links within a

rack•Since distances are longer, optical interfaces are required.

Page 13: Data Center Scale Computing

Three Stage Topology

Required to maintain same throughput as single switch.

Page 14: Data Center Scale Computing

Network Design

• Oversubscription ratios of 4-10 are common.• Limit network cost per server• Offloading to special networks• Centralized management

Page 15: Data Center Scale Computing

Service level response times

Consider servers with 99th, 99.9th and 99.99th latency > 1s vs # required service requests

Selective replication is one mitigating strategy

Page 16: Data Center Scale Computing
Page 17: Data Center Scale Computing

Power Supply Distribution

Uninterruptible Power Systems• Transfer switch used to chose active power input

from either utility sources or generator● After a power failure, the transfer switch will detect

the power generator and after 10-15 seconds, provide power

● This power system has energy storage to provide additional protection between power failure of main utility power and when generators begin providing full load

● Levels incoming power feed to remove spikes and lags from AC-feed

Page 18: Data Center Scale Computing

Example of Power Distribution UnitsTraditional PDU•Takes in power output from

UPS•Regulates power with

transformers to distribute power to servers

•Handles 75-225 kW typically•Provides Redundancy by

switching between 2 power sources

Page 19: Data Center Scale Computing

Examples of Power DistributionFacebook’s power distribution system

• Designed to increase power efficiency by reducing energy loss to about 15%

• Eliminates the UPS and PDU and adds on-board 12v battery for each cabinet

Page 20: Data Center Scale Computing

Power Supply Cooling NeedsAir Flow Consideration•Fresh Air cooling

o “Opening the windows”

•Closed loop systemo Underfloor

systemso Servers are on

raised concrete tile floors

Page 21: Data Center Scale Computing

Power Cooling Systems

2-loop SystemsLoop 1 - Hot Air/Cool air circuit (Red/Blue Arrows)Loop 2 - Liquid supply to Computer Room Air Conditioning Units and heat discharging

Page 22: Data Center Scale Computing

Example of Cooling System Design

3 - Loop System• Chiller sends cooled

water to CRACs

• Heated water sent from building to chiller for heat dispersal

• Condenser water loop flows into cooling tower

Page 23: Data Center Scale Computing

Cooling System for Google

Page 24: Data Center Scale Computing

Estimated Annual Costs

Page 25: Data Center Scale Computing

Estimated Carbon Costs for Power

Based on local utility power generated via the use of oil, natural gas, coal or renewable sources, including hydroelectricity, solar energy, wind and biofuels

Page 26: Data Center Scale Computing

Sources of Efficiency Loss• Overheading cooling systems,

such as chillers

• Air movement

• IT Equipment

• Power distribution unit

Power EfficiencyImprovements to Efficiency

• Handling air flow more carefully. Keep cooling path short and separate hot air from servers from system

• Consider raising cooling temperatures

• Employ “free cooling” by locating datacenter in cooler climates

• Select more efficient power system

Page 27: Data Center Scale Computing

Data Center Failures

Reliability of Data CenterTrade off between cost of failures, along with repairing,

and preventing failures.

Fault Tolerances•Traditional servers require high degree of reliability and redundancy to prevent failures as much as possible•For data warehouses, this is not practical

o Example: a cluster of 10,000 servers will have an average of 1 server failure/day

Page 28: Data Center Scale Computing

Data Center Failures

Fault Severity Categories•Corrupted

o Data is lost, corrupted, or cannot be regenerated•Unreachable

o Service is down•Degraded

o Service is available, but limited•Masked

o Faults occur but due to fault tolerance, this is masked from user

Page 29: Data Center Scale Computing

Data Center Fault Causes

Causes•Software errors•Faulty configs•Human Error•Networking faults•Faulty hardware

It’s easier to tolerate known hardware issues than software bugs or human error.

Repairs•It’s not critical to quickly repair individual servers•In reality, repairs are scheduled as a ‘daily sweep’•Individual failures mostly do not affect overall data center health•System is designed to tolerate faults

Page 30: Data Center Scale Computing

Google Restarts and Downtime

Page 31: Data Center Scale Computing

Relatively New Class of Computers

• Facebook founded in 2004• Google’s Modular Data Center in 2005• Microsoft’s Online Services Division in 2005• Amazon Web Services in 2006• Netflix added streaming in 2007

Page 32: Data Center Scale Computing

Balanced System

• Nature of workload at this scale is:o Large volumeo Large varietyo Distributed

• This means no servers (or parts of servers) get to slack while others do the work.

• Keep servers busy to amortize cost• Need high performance from all

components!

Page 33: Data Center Scale Computing

Imbalanced Parts

• Latency lags bandwidth

Page 34: Data Center Scale Computing

Imbalanced Parts

• CPUs have been historical focus

Page 35: Data Center Scale Computing

Focus Needs to Shift

• Push toward SaaS will highlight these disparities

• Requires concentrating research:o Improving non-CPU componentso Improving responsivenesso Improving end-to-end experience

Page 36: Data Center Scale Computing

Why does latency matter?

• Responsiveness dictated by latency• Productivity affected by responsiveness

Page 37: Data Center Scale Computing

Real Estate Considerations

• Land• Power• Cooling• Taxes• Population• Disasters

Page 38: Data Center Scale Computing

Google’s Data Centers

Page 39: Data Center Scale Computing

Economical Efficiency• DC is non-trivial cost

o Does not include land• Servers is bigger cost

o More servers desirableo Busy servers desirable

Page 40: Data Center Scale Computing

Improving Efficiency

• Better componentso Energy proportional (less use == less energy)

• Power-saving modeso Transparent (e.g., clock-gating)o Active (e.g., CPU throttling)o Inactive (e.g., idle drives stop spinning)

Page 41: Data Center Scale Computing

Changing Workloads

• Workloads more agile in nature• SaaS Shorter release cycles

o Office 365 updates several times per yearo Some Google services update weekly

• Even major software gets rewritteno Google search engine re-written from scratch 4

times

• Internet services are still youngo Usage can be unpredictable

Page 42: Data Center Scale Computing

YouTube

• Started in 2005• Fifth most popular site within first year

Page 43: Data Center Scale Computing
Page 44: Data Center Scale Computing

Adapting

• Strike balance of need to deploy with longevityo Need it fast and good

• Design to make software easy to createo Easier to find programmers

• Redesign when warrantedo Google Search’s rewrites removed inefficiencieso Contrast to Intel’s backwards compatibility spanning

decades

Page 45: Data Center Scale Computing

Future Trends

● Continued emphasis on:○ Parallelism○ Networking, both within and to/from datacenters○ Reliability via redundancy○ Optimizing efficiency (energy proportionality)

● Environmental impact● Energy costs

● Amdahl’s law will remain major factor● Need increased focus on end-to-end

systems● Computing as a utility?

Page 46: Data Center Scale Computing

“Anyone can build a fast CPU. The trick is to build a fast system.”

-Seymour Cray

Page 47: Data Center Scale Computing

“Anyone can build a fast CPU. The trick is to build a fast system.”

-Seymour Cray