mmi2012050052 52.homepages.inf.ed.ac.uk/bgrot/pubs/sop-tco_ieeemicro.pdfprocessors (sops). our sop...

..........................................................................................................................................................................................................................

OPTIMIZING DATA-CENTER TCOWITH SCALE-OUT PROCESSORS

..........................................................................................................................................................................................................................

PERFORMANCE AND TOTAL COST OF OWNERSHIP (TCO) ARE KEY OPTIMIZATION METRICS

IN LARGE-SCALE DATA CENTERS. ACCORDING TO THESE METRICS, DATA CENTERS

DESIGNED WITH CONVENTIONAL SERVER PROCESSORS ARE INEFFICIENT. RECENTLY

INTRODUCED PROCESSORS BASED ON LOW-POWER CORES CAN IMPROVE BOTH

THROUGHPUT AND ENERGY EFFICIENCY COMPARED TO CONVENTIONAL SERVER CHIPS.

HOWEVER, A SPECIALIZED SCALE-OUT PROCESSOR (SOP) ARCHITECTURE MAXIMIZES

ON-CHIP COMPUTING DENSITY TO DELIVER THE HIGHEST PERFORMANCE PER TCO AND

PERFORMANCE PER WATT AT THE DATA-CENTER LEVEL.

......We are in the midst of an infor-mation revolution, driven by ubiquitousaccess to vast data stores via a variety ofrichly networked platforms. Data centersare the workhorses powering this revolution.Companies leading the transformation to thedigital universe, such as Google, Microsoft,and Facebook, rely on networks of megascaledata centers to provide search, social connec-tivity, media streaming, and a growing num-ber of other offerings to large, distributedaudiences. A scale-out data center poweringcloud services can house tens of thousandsof servers that are necessary for high scalabil-ity, availability, and resilience.1

The massive scale of such data centersrequires an enormous capital outlay for infra-structure and hardware, often exceeding$100 million per data center.2 Similarly ex-pansive are the power requirements, typicallyin the range of 5 to 15 MW per data center,totaling millions of dollars in annual operat-ing costs. With demand for informationservices skyrocketing around the globe,

efficiency has become a paramount concernin the design and operation of large-scaledata centers.

To reduce infrastructure, hardware, andenergy costs, data-center operators targethigh computing density and power effi-ciency. Total cost of ownership (TCO) isan optimization metric that considers thecosts of real estate, power delivery and cool-ing infrastructure, hardware acquisitioncosts, and operating expenses. Because serveracquisition and power costs constitute thetwo largest TCO components,3 servers pres-ent a prime optimization target in the questfor more efficient data centers. In additionto cost, performance is also critical in scale-out data centers designed to service thou-sands of concurrent requests with real-timeconstraints. The ratio of performance toTCO (performance per dollar of ownershipexpense) is thus an appropriate metric forevaluating different data-center designs.

Scale-out workloads prevalent in large-scale data centers rely on in-memory

[3B2-9] mmi2012050052.3d 19/9/012 13:14 Page 52

Pejman Lotfi-Kamran

Babak Falsafi

EPFL

Chrysostomos Nicopoulos

Yiannakis Sazeides

University of Cyprus

Boris Grot

EPFL

Damien Hardy

University of Cyprus

..............................................................

52 Published by the IEEE Computer Society 0272-1732/12/$31.00 �c 2012 IEEE

processing and massive parallelism to guaran-tee low response latency and high throughput.Although processors ultimately determine aserver’s performance characteristics, theycontribute just a fraction of the overall pur-chase price and power burden in a servernode. Memory, disk, networking equipment,power provisioning, and cooling all contrib-ute substantially to acquisition and operatingcosts. Moreover, these components are lessenergy proportional than modern processors,meaning their power requirements don’tscale down well as the server load drops.Thus, maximizing the benefit from theTCO investment requires getting high utili-zation from the entire server, not just theprocessor.

To achieve high server utilization, datacenters must employ processors that canfully leverage the available bandwidth tomemory and I/O. Conventional server pro-cessors use powerful cores designed for abroad range of workloads, including scien-tific, gaming, and media processing. As a re-sult, they deliver good performance acrossthe workload range, but they fail to maxi-mize either performance or efficiency onmemory-intensive scale-out applications.Emerging server processors, on the otherhand, employ simpler core microarchitec-tures that improve efficiency but fall shortof maximizing performance. What the in-dustry needs are server processors that jointlyoptimize for performance, energy, and TCO.

With this in mind, we developed a method-ology for designing performance-density-optimal server chips called Scale-OutProcessors (SOPs). Our SOP methodologyimproves data-center efficiency through amany-core organization tuned to the demandsof scale-out workloads.

Today’s server processorsMulticore processors common today are

well-suited for massively parallel scale-outworkloads running in data centers. First, theyimprove throughput per chip over single-coredesigns. Second, they amortize on-chip andboard-level resources among multiple hard-ware threads, thereby lowering both costand power consumption per unit of work(that is, per thread).

Table 1 summarizes the principal charac-teristics of today’s server processors. Existingdata centers are built with server-class designsfrom Intel and AMD. A representative pro-cessor is Intel’s Xeon 5670,4 a mid-rangedesign that integrates six powerful dual-threaded cores and a spacious 12-Mbytelast-level cache (LLC). The Xeon 5670 con-sumes 95 W at a maximum frequency of3 GHz. The combination of powerful coresand relatively large chip size leads us to clas-sify conventional server processors as big-core,big-chip designs.

Recently, several companies have intro-duced processors featuring simpler coremicroarchitectures that specifically target

[3B2-9] mmi2012050052.3d 19/9/012 13:14 Page 53

Table 1. Server chip characteristics. The first three processors are existing designs,

and the last two processors are proposed designs.

Processor Type

Cores,

threads

Last-level

cache size

(Mbytes)

DDR3

interfaces

Frequency

(GHz)

Power

(W)

Area

(mm2)

Cost per

processor

($)

Big core,

big chip

Conventional 6, 12 12 3 3 95 233 800

Small core,

small chip

Small chip 4, 4 4 1 1.5 6 62 95

Small core,

big chip

Tiled 36, 36 9 2 1.5 28 132 300

Scale-out,

in order

Scale-Out

Processor

48, 48 4 3 1.5 34 132 320

Scale-out,

out of order

Scale-Out

Processor

16, 16 4 2 2 33 132 320

....................................................................

SEPTEMBER/OCTOBER 2012 53

scale-out data centers. Research has shownsimple-core designs to be well-matched tothe demands of many scale-out workloads,which spend a high fraction of their timeaccessing memory and have moderate com-putational intensity.5 Two design para-digms have emerged in this space: onetype features a few small cores on a smallchip (small core, small chip); the other inte-grates many small cores on a bigger chip(small core, big chip).

Companies including Calxeda, Marvell,and SeaMicro market small-core, small-chipprocessors targeted at data centers. Despitethe differences in the core organizationand even the instruction set architecture(ISA)—Calxeda’s and Marvell’s designs arepowered by ARM, whereas SeaMicro usesan x86-based Atom processor—the chipsare surprisingly similar in their feature set:all have four hardware contexts, dual-issuecores, a clock speed in the range of 1.1 to1.6 GHz, and power consumption of 5 to10 W. We use the Calxeda design as a rep-resentative configuration, featuring fourCortex-A9 cores, a 4-Mbyte LLC, and anon-die memory controller.6 At 1.5 GHz,our model estimates a peak power con-sumption of 6 W.

A processor representative of the small-core, big-chip design philosophy is Tilera’sTile-Gx3036. This server-class processor fea-tures 36 simple cores and a 9-Mbyte LLC ina tiled organization.7 Each tile integrates a

core, a slice of the shared LLC, and a router.Accesses to the distributed LLC’s remotebanks require a traversal of the on-chipinterconnect, implemented as a 2D meshnetwork with a single-cycle per-hop delay.Operating at 1.5 GHz, the Tilera-like tileddesign draws approximately 28 W of powerat peak load.

To understand the efficiency implicationsof these diverse processor architectures, weuse a combination of analytic models andsimulation-based studies, employing a full-system server simulation infrastructure, to esti-mate their performance, area, and power char-acteristics. Our workloads are taken fromCloudSuite (http://parsa.epfl.ch/cloudsuite),a collection of representative scale-out appli-cations that includes web search, data serv-ing, and MapReduce.

Figure 1a compares the designs along twodimensions: performance density and energyefficiency. Performance density, expressed asperformance per mm2, measures the process-or’s ability to effectively utilize the chip realestate. Energy efficiency, in units of perfor-mance per watt, indicates the processor’sability to convert energy into useful work.

The small-core, small-chip processoroffers a 2.2� improvement in energy effi-ciency over a conventional big-core design,thanks to the former’s simpler core micro-architecture. However, the small-chip designhas 45 percent lower performance densitythan the conventional one. To better

[3B2-9] mmi2012050052.3d 19/9/012 13:14 Page 54

(a) (b)

0

0.3

0.6

0.9

1.2

0

0.07

0.14

0.21

0.28

Conventional Small chip Tiled

Perf

orm

ance/w

att

Perf

orm

ance/m

m2

Performance density

Energy efficiency

Pow

er

(watt)

Cache DDR

Misc. Core

0

10

20

30

40

50

0

50

100

150

200

250

Area Power Area Power Area Power

Conventional Small chip Tiled

Are

a (

mm

2)

95

Figure 1. Efficiency, area, and power of today’s server processors: performance density and energy efficiency (a);

processor area and power breakdown (b). We use a combination of analytic models and simulation-based studies to

estimate the performance, area, and power characteristics.

....................................................................

54 IEEE MICRO

...............................................................................................................................................................................................

ENERGY-AWARE COMPUTING

understand the trends, Figure 1b shows abreakdown of the respective processors’ areaand power budgets. The data reveals thatwhile the cores in the conventional serverprocessor take up 44 percent of the chiparea, the small-chip design commits just20 percent of the chip to compute, with theremainder of the area going to the LLC,I/O, and auxiliary circuitry. In terms ofpower, the six conventional cores consume71 W of the 95-W power budget (75 per-cent), whereas the four simpler cores in thesmall-chip organization dissipate just 2.4 W(38 percent of total chip power) under fullload. As with the area, the relative energycost of the cache and peripheral circuitry inthe small-chip design is greater than in theconventional design (62 percent and 25 per-cent of the respective chips’ power budgets).

The most-efficient design point is thesmall-core, big-chip tiled processor, whichsurpasses both conventional and small-chipalternatives by more than 88 percent in per-formance density, and 65 percent in energyefficiency. The cores in the tiled processortake up 36 percent of the chip real estate,nearly doubling the fraction of the area dedi-cated to execution resources as comparedto the small-chip design. The fraction ofthe power devoted to execution resourcesincreases to 48 percent compared to 38 per-cent in the small-chip design.

Our results corroborate earlier studies thatidentify efficiency benefits stemming fromthe use of lower-complexity cores as com-pared to those used in conventional serverprocessors.8,9 However, our findings alsoidentify an important, yet unsurprising,trend: the use of simpler cores by themselvesis insufficient for maximizing processor effi-ciency, and the chip-level organization mustbe considered. More specifically, a largerchip that integrates many cores is necessaryto amortize the area and power expenseof uncore resources, such as cache andoff-chip interfaces, by multiplexing themamong the cores.

Scale-Out ProcessorsTo maximize silicon efficiency on scale-

out workloads, we examined the characteris-tics of a suite of representative scale-outapplications and the demands they place on

processor resources. Our findings, consistentwith prior work,10,11 indicate that

� large LLCs are not beneficial for cap-turing data-center applications’ enor-mous data footprint;

� the active instruction footprint greatlyexceeds the Level-1 (L1) caches’ capac-ity, but can be accommodated with a2- to 4-Mbyte secondary cache; and

� scale-out workloads have virtually nothread-to-thread communication, requir-ing minimal on-chip coherence andcommunication infrastructure.

Driven by these observations, we devel-oped the SOP design methodology thatextends the small-core, big-chip design spaceby optimizing the on-chip cache capacity,core count, interconnect delay, and numberof interfaces to the off-chip memory in away that maximizes computing density andthroughput.12

At the heart of an SOP is a coarse-grainedbuilding block called a pod—a stand-alonemulticore server. Each pod features a mod-estly sized 2- to 4-Mbyte LLC for capturingthe active instruction footprint and com-monly accessed data structures. The smallLLC size reduces the cache access time andleaves more chip area for the cores. Tofurther reduce the latency of performance-critical LLC accesses, SOPs use a high-bandwidth crossbar interconnect instead ofa multihop point-to-point network. Thenumber of cores in a pod is empirically cho-sen in a way that maximizes cache utilizationwithout causing thrashing or penalizinginterconnect area and delay.

The SOP architecture achieves scalabilitythrough tiling at the pod granularity up tothe available area, power, or memory band-width limit. The multiple pods share theoff-chip interfaces to reduce cost and maxi-mize bandwidth utilization. The pod-basedtiling strategy reduces chip-level complexityand provides a technology-scalable architec-ture that preserves each pod’s optimalityacross technology generations. Figure 2 com-pares the SOP chip architecture to conven-tional, small-chip, and tiled designs.

Compared to a tiled design, an SOPincreases the number of cores integrated on

[3B2-9] mmi2012050052.3d 19/9/012 13:14 Page 55

....................................................................


chip by 33 percent, from 36 to 48, withinthe same area budget. SOPs achieve thiscomputing-density improvement by reduc-ing the LLC capacity from 9 to 4 Mbytes,freeing up chip area for the cores. The result-ing SOP devotes 47 percent of the chip areato the cores, up from 36 percent in the tiledprocessor. Our evaluation shows that the per-core performance in the SOP design is com-parable to that in a tiled one; although theSOP’s smaller LLC capacity has a dampen-ing effect on single-threaded performance,the lower access delay in the crossbar-basedSOP compensates for the higher miss ratioby accelerating fetches of performance-critical instructions. The bottom line is thatthe SOP design improves chip-level perfor-mance (that is, throughput) by 33 percentover the tiled processor. Finally, the SOPdesign’s peak power consumption is higherthan that of the tiled processor owing to

the former’s greater on-chip computing capac-ity. However, as our results demonstrate, theSOP’s greater chip-level processing capabilityis beneficial from a TCO perspective despitethe increased power draw at the chip level.

MethodologyWe now describe the cost models, server

hardware features, workloads, and simulationinfrastructure used in evaluating the variouschip organizations at the data-center level.

TCO modelLarge-scale data centers employ high-

density server racks to reduce the space foot-print and improve cost efficiency. A stan-dard rack can accommodate up to 42 1U(one-rack-unit) servers, with each serverintegrating one or more processors, multipleDRAM DIMMs, disk- or flash-based stor-age nodes, and a network interface. Serversin a rack share the power distribution infra-structure and network interfaces with therest of the data center. The number ofracks in a large-scale data center is com-monly constrained by the available powerbudget.

Our TCO analysis, derived usingEETCO,13 considers four major expensecategories. Table 2 further details the keyparameters.

Data-center infrastructure. This includes theland, building, and power provisioning andcooling equipment with a 15-year deprecia-tion schedule. The data-center area is primar-ily determined by the IT (rack) area, withcooling and power provisioning equipmentfactored in. We estimate the cost of thisequipment per watt of critical power.

Server and networking hardware. Serverhardware includes processors, memory,disks, and motherboards. We also accountfor the networking gear at the data center’sedge, aggregation, and core layers and as-sume that the cost scales with the numberof racks. The amortization schedule is threeyears for server hardware, and four years fornetworking equipment.

Power. This is predominantly determinedby the servers, including fans and power

[3B2-9] mmi2012050052.3d 19/9/012 13:14 6

(a)

(c)

(b)

Interconnect

Memory + I/O ports

LLC ($)

C C C C

(d)

Interconnect

Memory + I/O ports

LLC ($)

Core CoreCore

Core CoreCore

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

Mem

ory

+ I/O

port

s

Memory + I/O ports

Interconnect

LLC ($)

CC CC

CC CC

CC CC

InterconnectCC CC

CC CC

CC CC

LLC ($)

Interconnect

CC CC

CC CC

CC CC

InterconnectCC CC

CC CC

CC CC

Figure 2. Comparison of server processor organizations: conventional (a),

small chip (b), tiled (c), and Scale Out (d). The Scale-Out design achieves

the highest performance density through modestly sized caches and many

cores in a multi-pod organization.

....................................................................

56 IEEE MICRO

...............................................................................................................................................................................................


supplies, networking gear, and cooling equip-ment. The electricity cost is $0.07/KWh.

Maintenance. This includes costs for repair-ing faulty equipment, determined by itsmean time to failure (MTTF), and the sal-aries of the personnel.

Server processorsWe evaluate a number of data-center

server processor designs, as summarized inTable 1. We use publicly available datafrom the open web and Microprocessor Report(http://www.mpronline.com/index.php) toestimate core and chip area, power, andcost. We supplement this information withCacti 6 for cache area and power profiles,and we use measurements of commercialserver processors’ die micrographs to estimatethe area of on-chip memory and I/O inter-faces. We detail many aspects of the method-ology in our International Symposium onComputer Architecture (ISCA) 2012 paper.12

Baseline parameters. To make the analysistractable and reduce variability due to differ-ences in the ISA, we model ARM-basedcores for all but the conventional processordesigns. We based both the tiled and SOPin-order configurations on the Cortex-A8,a dual-issue in-order core clocked at1.5 GHz. The small-chip design uses a

more aggressive dual-issue out-of-order core,similar to an ARM Cortex-A9, which is rep-resentative of existing products in the small-chip processor design space. For the conven-tional design, we model a custom four-widelarge-window core running at 3 GHz.

At the chip level, we model the targetprocessors’ associated cache configurations,interconnect topologies, and memory inter-faces. Our simulations reflect important run-time artifacts of these structures, such asinterconnect delays, bank contention, andoff-die bandwidth limitations.

Effect of higher-complexity cores. Althoughthe low-complexity out-of-order and in-order cores are attractive from an area- andenergy-efficiency perspective, their lower per-formance could be unacceptable for applica-tions that demand fast response times andhave nontrivial computational components(for example, web search8). To study the ef-fect of higher-performance cores on data-center efficiency, we also evaluate an SOPorganization based on an ARM Cortex-A15, a triple-issue core clocked at 2 GHz.Compared to the baseline dual-issue in-order core, a three-way out-of-order coredelivers 81 percent higher single-threadedperformance, on average, across our work-load suite, while requiring 3.5� the areaand more than 3� the power per core.

[3B2-9] mmi2012050052.3d 19/9/012 13:14 Page 57

Table 2. Total cost of ownership (TCO) parameters.

Parameter Value

Rack dimensions (42U): width � depth � inter-rack

space

0.6 m � 1.2 m � 1.2 m

Infrastructure cost $3,000/m2

Cooling and power provisioning equipment cost $12.5/W

Cooling and power provisioning equipment space

overhead

20%

SPUE (server power usage effectiveness) factor 1.3

PUE (data-center power usage effectiveness) factor 1.3

Personnel cost $200 per rack/month

Networking gear 360 W, $10,000 per rack

Motherboard 25 W, $330 per rack unit

Disk 10 W, $180, 100-year mean time to

failure (MTTF)

DRAM 1 W, $25, 800-year MTTF per Gbyte

Processor 30-year MTTF

....................................................................


Effect of chip size. The results in Figure 1suggest that small-core designs on a largerchip improve performance density and en-ergy efficiency as compared to small-chiporganizations. To better understand the ef-fect of chip size on data-center efficiency,we extend the evaluation space of tiled andSOP processors with additional designs fea-turing twice the core, cache, and memoryinterfaces. These ‘‘2�’’ designs approxi-mately match the area of the Xeon-basedconventional processor considered in thisstudy.

Processor price estimation. We estimated theprice for the conventional processor by pick-ing the lowest price ($800) among onlinevendors for the target Xeon 5670 processor.Prices for tiled ($300) and small-chip ($95)designs are sourced from the November2011 Microprocessor Report and correspondto Tilera Tile-Gx3036 and Calxeda ECX-1000, respectively.

To compute the cost for the various SOPand 2� tiled designs, we used the CadenceInCyte Chip Estimation tool. We estimatedthe production volume of the Tilera Gx-3036 processor to be 200,000 units, givena selling price of $300 and a 50-percentmargin. We used this production volumeto estimate each processor type’s sellingprice, considering nonrecurring engineering(NRE) costs, mask and production costs,yield, other expense categories, and a 50-percent profit margin. Although we usedthese estimates for the majority of the stud-ies, we also considered the sensitivity of dif-ferent designs to processor price.

Workloads and simulation infrastructureWe took our workloads, which included

Data Serving, MapReduce, SAT Solver, WebFrontEnd, and Web Search, from CloudSuite.For the Web FrontEnd workload, we usedthe e-banking option from SPECweb2009in place of its open-source counterpartfrom CloudSuite, because SPECweb2009exhibits better performance scalability athigh core counts. Functionally, all theseapplications have similar characteristics—namely, they operate on huge data sets thatare split across a large number of nodesinto memory-resident shards, the nodes

service a large number of completely inde-pendent requests that don’t share state, andthe internode connectivity is used only forhigh-level task management and coordina-tion.11 Two of the workloads—SAT Solverand MapReduce—are batch, whereas therest are latency sensitive and are tuned tomeet the response-time objectives.

We estimated the various processordesigns’ performance using the Flexus full-system simulation.14 Our performance met-ric is the product of UIPC (a ratio of com-mitted user instructions over the sum ofboth user and system cycles) and processorfrequency. UIPC is a more accurate perfor-mance metric in full-system evaluation thantotal IPC due to the contribution of I/Oand spin locks in the operating system tothe execution time.14

Because of space constraints, we onlypresent aggregate results across all workloads.We averaged performance using a harmonicmean.

Experimental setupFor all experiments, we assume a fixed

data-center power budget of 20 MW and apower limit of 17 kW per rack. We evaluatedlower-density racks rated at 6.6 kW, but wefound the trends to be similar across the tworack configurations. Therefore, for spaceconsiderations, we present only one set ofresults.

To compare the different server architec-tures’ performance and TCO, we start witha rack power budget and subtract all powercosts at both the rack and board level,excluding the processors. The per-rack costsinclude network gear, cooling (fans), andpower conversion. At the 1U server level,we account for the motherboard, two disks,and memory (model parameter) power.The remaining power budget is divided byeach evaluated processor chip’s peak powerto determine the number of processors perserver. We then estimate data-center perfor-mance as the product of per-processor perfor-mance (using the data collected in simulation)and the number of processors in each 1Userver, the number of servers in a rack, andthe number of racks in the data center.

Finally, we make no assumptions aboutwhat the optimal amount of memory per

[3B2-9] mmi2012050052.3d 19/9/012 13:14 Page 58

....................................................................

58 IEEE MICRO

...............................................................................................................................................................................................


server is, which in practice varies for differentworkloads. Instead, we model three serverconfigurations—with 32, 64, and 128Gbytes of memory per 1U. One simplifyingassumption we make is that the amount ofmemory per 1U is independent of the chipdesign. Underlying this assumption are theobservations that the data is predominantlyread-only, and is partitioned for high par-allelism, allowing performance to scalewith more cores and sockets until bottle-necked by the bandwidth of the memoryinterfaces. Our studies account for band-width limitations.

EvaluationWe examined the various processor designs’

performance and TCO, relative efficiency,and sensitivity to processor price.

Performance and TCOWe first compare data-center perfor-

mance and TCO for various processordesigns, assuming 64 Gbytes of memoryper 1U server. Figure 3 presents the results.

In general, we observe significant dispar-ity in data-center performance across theprocessor range, stemming from the differentcapabilities and energy profiles of the variousprocessor architectures. Highly integratedprocessors based on small cores, namelytiled and SOP, deliver the highest perfor-mance at the data-center level. The tiledsmall-core, big-chip architecture improvesaggregate performance by a factor of 3.6over conventional and 1.6 over small-chipdesigns. The tiled design is superior, thanksto a combination of efficient core microarch-itectures and high chip-level integration—attributes that help amortize the power ofboth chip- and node-level resources amongmany cores, affording more power for execu-tion resources.

The highest performance is delivered bythe SOP with in-order cores, a small-core,big-chip design with the highest performancedensity, which improves data-center perfor-mance by an additional 10 percent over thetiled processor. The SOP design effectivelytranslates its performance-density advantageinto a performance advantage by betteramortizing fixed-power overheads amongits many cores, ultimately affording more

power for the execution resources at therack level.

The SOP design based on out-of-ordercores sacrifices 39 percent of the throughputat the data-center level compared to the in-order design. However, higher core complex-ity might be justified for workloads that de-mand tight latency guarantees and have anontrivial computational component. Evenwith higher-complexity cores, the SOP archi-tecture attains better data-center performancethan either the conventional or small-chipalternatives.

The differences in TCO among the dif-ferent designs aren’t as pronounced as the

[3B2-9] mmi2012050052.3d 19/9/012 13:14 Page 59

(a)

0

1

2

3

4

5

Perf

orm

ance

(norm

aliz

ed

to c

onventional)

Con

vent

iona

l

Small c

hip

Tiled

Tiled-2

×SO

P

SOP-2

×

SOP-O

oO

SOP-O

oO-2

×

(b)

0

0.5

1.0

1.5

TC

O

(norm

aliz

ed

to c

onventional)

Con

vent

iona

l

Small c

hip

Tiled

Tiled-2

×SO

P

SOP-2

×

SOP-O

oO

SOP-O

oO-2

×

Figure 3. Data-center performance and TCO for various server processors

normalized to a design based on the conventional processor: data-center

performance (higher is better) (a); data-center TCO (lower is better) (b).

Differences in processors’ organizations result in large performance

differences at the data-center level. Meanwhile, processor choice has a

modest effect on data-center TCO as other costs dominate.

....................................................................


differences in performance, because process-ors contribute only a fraction (19 to 39 per-cent) to the overall data-center acquisitionand power budget. Nevertheless, one impor-tant trend worth highlighting is that al-though small-chip designs are significantlyless expensive and more energy efficient(by a factor of 8 and 2.2, respectively) thanconventional processors on a per-unit basis,a small-chip design has a 25 percent higherTCO at the data-center level. The reasonfor this apparent paradox is that thesmall-chip design’s limited computingcapabilities necessitate as many as 32 socketsper 1U server (versus 2 for conventional) tosaturate the available power budget. The ac-quisition costs of such a large number ofchips negate the differences in unit priceand energy efficiency, emphasizing theneed to consider TCO in assessing data-center efficiency.

Relative efficiencyWe next examine the combined effects of

performance, energy efficiency, and TCO byassessing the various designs on data-centerperformance/TCO and performance/watt.Figure 4 presents the results as memorycapacity varies from 32 to 128 Gbytes per1U server.

With 64 Gbytes of memory per 1Userver, we observe the following trends:

� The small-chip design improvesperformance/watt by 2.2� over the con-ventional processor, but its performance/TCO advantage is just 1.8� due to thehigh processor acquisition costs.

� A data center based on the tiled designimproves performance per TCO by afactor of 3.4 over conventional and1.9 over small-chip designs. Energy effi-ciency is improved by 3.6� and 1.6�,respectively, underscoring the combinedbenefits of aggressive integration and anefficient core microarchitecture.

� SOP designs with an in-order core fur-ther improve performance/TCO by14 percent and performance/watt by10 percent over tiled processors througha more efficient use of chip real estate.

� Tiled and SOP designs with twice theresources (Tiled-2� and SOP-2�) im-prove TCO by 11 to 15 percent overtheir baseline counterparts by reducingthe number of processor chips, therebylowering acquisition costs.

� The SOP design with out-of-ordercores achieves 40 percent lower per-formance/TCO than the design based

[3B2-9] mmi2012050052.3d 19/9/012 13:14 Page 60

0.4

0.8

1.2

1.6

0.1

0.2

0.3

0.4

0.5

2.0

0

Pe

rfo

rma

nc

e/T

CO

64 Gbytes 128 Gbytes

Memory capacity/server (1U)

32 Gbytes 64 Gbytes 128 Gbytes

Memory capacity/server (1U)

32 Gbytes

(a)

0.6

0

Pe

rfo

rma

nc

e/w

att

(b)

Conventional Small chip Tiled Tiled-2×SOP SOP-2× SOP-OoO SOP-OoO-2×

Figure 4. Data-center efficiency for different server processors: performance/TCO (a); performance/watt (b). Data is not

normalized. Small-core, big-chip designs (that is, scale out and tiled) are the most efficient.

....................................................................

60 IEEE MICRO

...............................................................................................................................................................................................


on in-order cores. The more aggressiveout-of-order microarchitecture is re-sponsible for each core’s lower energyand area efficiency, resulting in lowerthroughput at the chip level. Whenthe TCO premium is justified, whichmight be the case for profit-generatinglatency-sensitive applications (such asweb search), the out-of-order SOP de-sign offers a 2.3� performance/TCOadvantage (2.4� in performance/watt)over the conventional server processor.

Although the earlier discussion focuses onservers with 64 Gbytes of memory, thetrends are similar with other memory config-urations. In general, servers with more mem-ory lower the performance-to-TCO ratio, asmemory adds to the server cost while dimin-ishing the processor power budget. The op-posite is true for servers with less memory,in which the choice of processor has a greatereffect on both cost and performance.

The key result of our study—that highlyintegrated server processors are beneficialfrom a TCO perspective—is philosophicallysimilar to the observations made by Karidiset al., who noted that high-capacity serversare effective in lowering the cost ofcomputation.15

Sensitivity to processor priceFigure 5 shows the effect of varying the

processor price on the relative efficiency(performance/TCO) of the different designs,assuming 64 Gbytes of memory per 1Userver. For each processor type, except con-ventional, we assume an application-specificintegrated circuit (ASIC) design in 40-nmtechnology, and we compute the price as afunction of market size, ranging from 40Kto 2M units. For the conventional design, welinearly sweep the price from $800 (knownmarket price) down to $200, with the latterbeing the lowest considered price amongarea-equal Tiled-2� and SOP-2� designs.

In general, we observe that the price oflarger chips has less impact on the data-center TCO compared to that of smallerchips because it takes few large chips to pop-ulate a server due to power constraints. Incontrast, the small-chip design is highly sen-sitive to unit price, owing to the sheer

volume of chips required per 1U server.For instance, the respective numbers ofchips per server in conventional versussmall-chip designs differ by a factor of 16.

A consistent trend in our study is that,from a TCO perspective, bigger chips arepreferable to smaller ones, as seen in thecurves for the various tiled and SOP designs.Although the larger chip area of the ‘‘2�’’designs adds expense, the price difference ismodest (around 16 percent or $50 perchip), because NRE and design costs domi-nate production costs. Furthermore, theincreased cost is offset by the reduction inthe number of required chips.

S cale-Out Processors extend the advan-tages of emerging small-core, big-chip

architectures, providing good energy effi-ciency via simple core microarchitectures,and maximizing the TCO investment byfully utilizing server hardware via abundantexecution resources. In the near term, SOPswill be able to deliver performance andTCO gains in a nondisruptive manner byfully leveraging existing software stacks.Further out, demands for greater perfor-mance and energy efficiency could necessitate

[3B2-9] mmi2012050052.3d 19/9/012 13:14 Page 61

0

0.3

0.6

0.9

1.2

1.5

1.8

0 200 400 600 800P

erf

orm

ance/T

CO

Processor cost ($)

Conventional Small chip Tiled Tiled-2×SOP SOP-2× SOP-OoO SOP-OoO-2×

Figure 5. Relationship between the price per processor and the TCO. Solid

circles indicate known prices; unfilled circles show estimated prices based

on a production volume of 200K units.

....................................................................


even higher degrees of specialization, requir-ing a disruptive change to the programmingmodel. Our ongoing research effectivelytargets both near- and long-term data-centerefficiency challenges through integration,specialization, and approximation—the new‘‘ISA’’ of the 21st century. M I CR O

AcknowledgmentsWe thank Sachin Idgunji and Emre Ozer,

both affiliated with ARM during the workdescribed here, for their contributions tothe article. We thank the EuroCloud projectpartners for inspiring the Scale-Out Processors.This work was partially supported by Euro-Cloud, project 247779 of the European Com-mission 7th RTD Framework Programme—InformationandCommunicationTechnologies:Computing Systems.

....................................................................References

1. U. Holzle and L.A. Barroso, The Datacenter

as a Computer: An Introduction to the De-

sign of Warehouse-Scale Machines, Mor-

gan and Claypool Publishers, 2009.

2. SAVVIS, ‘‘SAVVIS Sells Assets Related to

Two Data Centers for $200 Million,’’ June

2007.

3. J. Hamilton, ‘‘Overall Data Center Costs,’’ blog,

18 Sept. 2010; http://perspectives.mvdirona.

com/2010/09/18/OverallDataCenterCosts.

aspx.

4. T.P. Morgan, ‘‘Facebook’s Open Hardware:

Does It Compute?’’ The Register, 8 Apr.

2011; http://www.theregister.co.uk/2011/

04/08/open_compute_server_comment,

2011.

5. M. Berezecki et al., ‘‘Many-Core Key-Value

Store,’’ Proc. Int’l Green Computing Conf.

and Workshops, IEEE CS, 2011, doi:10.1109/

IGCC.2011.6008565.

6. Calxeda, ‘‘Calxeda EnergyCore ECX-1000

Series,’’ May 2012; http://www.calxeda.com/

wp-content/uploads/2012/06/ECX1000-

Product-Brief-612.pdf.

7. Tilera, ‘‘TILE-Gx 3036 Specifications,’’ 2011,

http://www.tilera.com/sites/default/files/

productbriefs/ Tile-Gx%203036%20SB012-

01.pdf.

8. V.J. Reddi et al., ‘‘Web Search Using Mobile

Cores: Quantifying and Mitigating the Price

of Efficiency,’’ Proc. 37th Ann. Int’l Symp.

Computer Architecture (ISCA 10), ACM,

2010, pp. 314-325.

9. K. Lim et al., ‘‘Understanding and Designing

New Server Architectures for Emerging

Warehouse-Computing Environments,’’ Proc.

35th Ann. Int’l Symp. Computer Architecture

(ISCA 08), IEEE CS, 2008, pp. 315-326.

10. N. Hardavellas et al., ‘‘Toward Dark Silicon

in Servers,’’ IEEE Micro, July/Aug. 2011,

pp. 6-15.

11. M. Ferdman et al., ‘‘Clearing the Clouds: A

Study of Emerging Scale-Out Workloads

on Modern Hardware,’’ Proc. 17th Int’l

Conf. Architectural Support for Program-

ming Languages and Operating Systems

(ASPLOS 12), ACM, 2012, pp. 37-48.

12. P. Lotfi-Kamran et al., ‘‘Scale-Out Process-

ors,’’ Proc. 39th Ann. Int’l Symp. Computer

Architecture (ISCA 12), IEEE CS, 2012,

pp. 500-511.

13. D. Hardy et al., ‘‘EETCO: A Tool to Estimate

and Explore the Implications of Datacenter

Design Choices on the TCO and the Envi-

ronmental Impact,’’ Proc. 1st Workshop

Energy-Efficient Computing for a Sustain-

able World, 2011; http://eurocloudserver.

com/wp-content/uploads/EESC2011.pdf.

14. T. Wenisch et al., ‘‘SimFlex: Statistical Sam-

pling of Computer System Simulation,’’

IEEE Micro, July/Aug. 2006, pp. 18-31.

15. J. Karidis, J.E. Moreira, and J. Moreno,

‘‘True Value: Assessing and Optimizing the

Cost of Computing at the Data Center

Level,’’ Proc. 6th ACM Conf. Computing

Frontiers, ACM, 2009, pp. 185-192.

Boris Grot is a postdoctoral researcher atEPFL (Ecole Polytechnique Federale deLausanne). His research interests includeprocessor architectures, memory systems,and interconnection networks for high-throughput, energy-aware computing. Grothas a PhD in computer science from theUniversity of Texas at Austin. He is amember of IEEE and the ACM.

Damien Hardy is a postdoctoral researcherat the University of Cyprus. His researchinterests include computer architecture, re-liability, embedded and real-time systems,and data center modeling. Hardy has a PhDin computer science from the University ofRennes, France.

[3B2-9] mmi2012050052.3d 19/9/012 13:14 Page 62

....................................................................

62 IEEE MICRO

...............................................................................................................................................................................................


Pejman Lotfi-Kamran is a fifth-year PhDcandidate at EPFL. His research interestsinclude processor architecture and intercon-nection networks for high-throughput andenergy-efficient data centers. Lotfi-Kamranhas an MSc in computer architecture fromthe University of Tehran. He is a member ofIEEE and the ACM.

Babak Falsafi is a professor of computerand communication sciences at EPFL, andthe founding director of EcoCloud, aninterdisciplinary research center targetingrobust, economic, and environmentallyfriendly cloud technologies. Falsafi has aPhD in computer science from the Uni-versity of Wisconsin�Madison. He is afellow of IEEE and a senior member ofthe ACM.

Chrysostomos Nicopoulos is a lecturer inthe Department of Electrical and ComputerEngineering at the University of Cyprus. Hisresearch interests include networks on chip

(NoCs), computer architecture, multi- andmany-core microprocessor and system de-sign, and architectural support for massivelyparallel computing. Nicopoulos has a PhDin electrical engineering from PennsylvaniaState University. He is a member of IEEEand the ACM.

Yiannakis Sazeides is an associate professorin the Department of Computer Scienceat the University of Cyprus. His researchfocuses on computer architectures, particu-larly reliability, data-center modeling, mem-ory hierarchy, temperature, and analysis ofdynamic program behavior. Sazeides has aPhD in electrical engineering from theUniversity of Wisconsin�Madison. He is amember of IEEE.

Direct questions and comments aboutthis article to Boris Grot, EPFL IC ISIMPARSA, INJ 238 (Batiment INJ), Station14, CH � 1015, Lausanne, Switzerland;[email protected].

[3B2-9] mmi2012050052.3d 22/9/012 12:57 Page 63

....................................................................


mmi2012050052 52.homepages.inf.ed.ac.uk/bgrot/pubs/sop-tco_ieeemicro.pdfprocessors (sops). our sop...

Documents