deview 2013 rise of the wimpy machines - john mao

32
Rise of the (Wimpy) Machines Datacenter Efficiency with ARM-based Servers John Mao Director of Strategy, Calxeda

Upload: naver-d2

Post on 21-Jan-2015

880 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Deview 2013   rise of the wimpy machines - john mao

Rise of the (Wimpy) Machines Datacenter Efficiency with ARM-based Servers

John Mao!Director of Strategy, Calxeda!

Page 2: Deview 2013   rise of the wimpy machines - john mao

What is the name of the computer system in this movie that tried to end the human-race?

Skynet

Page 3: Deview 2013   rise of the wimpy machines - john mao
Page 4: Deview 2013   rise of the wimpy machines - john mao

Origins of Wimpy Core Computing

•  FAWN:  A  Fast  Array  of  Wimpy  Nodes  –  Project  from  CMU  led  by  Prof.  David  Anderson,  started  in  2008  (acDve  through  2012)  

– Measure  and  compare  performance  per  Joule  of    energy  advantages  over  tradiDonal  servers  

– Original  focus  on  large  distributed  key-­‐value  store    applicaDons  and  use-­‐cases  (i.e.  Amazon  Dynamo,    LinkedIn’s  Voldemort,  Facebook’s  memcached)  

 [PublicaDon]  hTp://www.sigops.org/sosp/sosp09/papers/andersen-­‐sosp09.pdf  

[Website]  hTp://www.cs.cmu.edu/~fawnproj/  

Page 5: Deview 2013   rise of the wimpy machines - john mao

FAWN: A Fast Array of Wimpy Nodes

•  Why  FAWN?  MoDvated  by  key  trends:  –  Increasing  CPU-­‐I/O  Gap  – CPU  power  consumpDon  grows  super-­‐linearly    with  speed  

– Dynamic  power  scaling  on  tradiDonal  systems  is  surprisingly  inefficient  

Page 6: Deview 2013   rise of the wimpy machines - john mao

FAWN: A Fast Array of Wimpy Nodes

[Photo  Credit]  h-p://www.cs.cmu.edu/~fawnproj/  

1G

2G

3G

4G

5G

Page 7: Deview 2013   rise of the wimpy machines - john mao

FAWN: A Fast Array of Wimpy Nodes

•  Multiple generations of hardware used: – 1G (2008)

•  Single-core 500MHz AMD Geode LX processor •  256MB DDR SDRAM (400MHz) •  100Mbps Ethernet

– 5G (2012) •  Intel Atom D510 – 1.66GHz dual-core w/HT •  2-4GB DDR2 (667MHz) •  100Mbps Ethernet

Page 8: Deview 2013   rise of the wimpy machines - john mao

Key Findings from FAWN Project

 “The  FAWN  cluster  achieves  364  queries  per    Joule  —  two  orders  of  magnitude  be-er  than    tradiDonal  disk-­‐based  clusters.”    

   

[Source]  hTp://www.sigops.org/sosp/sosp09/papers/andersen-­‐sosp09.pdf  

 

Page 9: Deview 2013   rise of the wimpy machines - john mao

So what about ARM®? ARM is a good “wimpy” processor & CPU architecture for the datacenter because: 1.  Focus on low power: origins in embedded

systems and mobile devices 2.  Datacenter focused roadmap: 32-bit CPUs

today, 64-bit CPUs in 1-2 years; increasing performance (with same energy efficiency)

3.  Business model: ability to integrate for specific markets and applications

4.  Emerging software ecosystem: while not x86, ARM has growing ecosystem

Page 10: Deview 2013   rise of the wimpy machines - john mao

Focus on Low Power •  History in targeting energy-sensitive markets:

– Netbooks, Smartbooks, Tablets, Thin Clients – Smartphones, Feature phones – Set-top Box, Digital TV, Blu-Ray players, Gaming

consoles – Automotive Infotainment, Navigation – Wireless base-stations, VoIP phones and

equipment •  Design Goals

– Performance, Power, Easy Synthesis

Page 11: Deview 2013   rise of the wimpy machines - john mao

Focus on Low Power

In  2005,  about  98%  of  all  mobile  phones  sold  used  at  least  one  ARM  processor.  

 As  of  2009,  due  to  low  power  consumpDon  the  ARM  architecture  is  the  most  widely  used  32-­‐bit  RISC    architecture  in  mobile  devices  and  embedded    systems.    

[Source]  hTp://en.wikipedia.org/wiki/ARM_architecture  

 

Page 12: Deview 2013   rise of the wimpy machines - john mao

Focus on Low Power

Translating ARM energy-efficiency into the modern datacenter with Cortex-A9:

Workload (on 24 nodes & SSDs)

Total System* Power (Today!)

~Power per ECX-1000 Node (with disk @Wall)

Linux at Rest 130 W 5.4 W phpbench 155 W 6.5 W Coremark (4 threads per SOC) 169 W 7.0 W

Website @ 70% Utilization 172 W 7.2 W

LINPACK 191 W 7.9 W STREAM 205 W 8.5 W

*All measurements done on a 24-node system @1.1GHz, with 24 SSDs and 96 GB DRAM in the Calxeda Lab.

For specific workloads, ECX-1000 can enable a complete 24-node cluster at similar power level as a 2 socket x86.

Page 13: Deview 2013   rise of the wimpy machines - john mao

But, what about performance?

Page 14: Deview 2013   rise of the wimpy machines - john mao

Online Review: Calxeda’s ARM Server Tested

Anandtech chartered review comparing Boston Viridis’ 24-Calxeda ECX-1000 (Cortex-A9) cluster against Intel E5-2650Lsystem.

(March 2012)

http://www.anandtech.com/show/6757/calxedas-arm-server-tested

Page 15: Deview 2013   rise of the wimpy machines - john mao

Calxeda Provides Better Web Throughput Boston Viridis outperforms Xeon E5-2650L by 30% withmore than 15 users.  Test  is  PHPbb  running  on  Apache2  with  variable  numbers  of  users  (concurrency)  generaDng  traffic.  

Page 16: Deview 2013   rise of the wimpy machines - john mao

Calxeda Provides Lower Response Times Boston Viridis outperforms Xeon E5-2650L by 60% withmore than 15 users.  Test  is  PHPbb  running  on  Apache2  with  variable  numbers  of  users  (concurrency)  generaDng  traffic.  

Page 17: Deview 2013   rise of the wimpy machines - john mao

Calxeda Provides Highest Performance/Watt Boston Viridis provides 80% more throughput per Watt than Xeon E5. •  10-36% less raw power  Test  is  PHPbb  running  on  Apache2  with  variable  numbers  of  users  (concurrency)  generaDng  traffic.  

Page 18: Deview 2013   rise of the wimpy machines - john mao

Online Review: Calxeda’s ARM Server Tested

Reviewer’s Key Takeaways: –  For scale-out workloads, Calxeda’s ARM-based scale-out

hardware architecture is very promising. –  Microbenchmarks show Calxeda ECX-1000 ~10% behind

Intel Atom N2800 @1.86 MHz –  “Real World” Application Benchmarking shows 70%+ higher

performance-per-watt than Intel Xeon E5 at mid to high user load –  “Calxeda really did it: each server needs about 8.3W (200W/24),

measured at the wall…about 6W (at 1.4GHz) per server node…” –  “So on the one hand, no, the current Calxeda servers are no

Intel Xeon killers (yet). However, we feel that Calxeda's ECX-1000 server node is revolutionary technology.”

Page 19: Deview 2013   rise of the wimpy machines - john mao

ARM® Cortex-A15

•  Based on ARMv7A architecture – Ensures software application compatibility

with orther Cortex-A processors •  LPAE support up to 1TB physical memory •  Full hardware virtualization support •  From ARM: delivers 2X performance over

Cortex-A9 processor with similar power •  big.LITTLE configuration support for

mobile devices

Page 20: Deview 2013   rise of the wimpy machines - john mao

2013 2014 2015

Datacenter Focused Roadmap

Sarita (ARM® Cortex A57) Compatible 64-bit On-Ramp for Early Access and

Ecosystem Enablement

Midway: ECX-2000 (4 Core, ARM® Cortex A15) Performance/$ for Cloud and Analytics

Highbank: ECX-1000 (4 Core, ARM® Cortex A9) Power Efficient Solution for Storage and Web Hosting

Lago (ARM® Cortex A57) Flagship 64-bit Product for a

Broader Application Set

“Triple Play”: 3 Generations of Pin-Compatible SOCs

3rd Generation Calxeda Fabric and I/O

[Source] Calxeda public SOC roadmap (June 2013)

Page 21: Deview 2013   rise of the wimpy machines - john mao

“Midway”: Calxeda ECX-2000

Compared to Calxeda’s Cortex-A9 SOC (ECX-1000), the “Midway” SOC delivers: – 1.5X more single-thread performance – 2X more floating point performance – 3X STREAM (memory b/w) performance – 4X+ more physical memory support (16GB+) – Same performance-per-Watt

Plan to update Anandtech benchmark report

Page 22: Deview 2013   rise of the wimpy machines - john mao

But, ARM doesn’t make/sell SOCs?

Page 23: Deview 2013   rise of the wimpy machines - john mao

ARM® Business Model

•  ARM does not make or sell SOC. •  Instead, ARM licenses IP and technology

to partners (like Calxeda) who design and build System-on-Chips (SOCs) for various industries and markets.

•  Calxeda is focused exclusively on bringing ARM-based technology to the datacenter. – Calxeda provides own IP (e.g. Fabric) as

additional value for servers.

Page 24: Deview 2013   rise of the wimpy machines - john mao

EnergyCore® architecture at a glance A complete building block for hyper-efficient computing

EnergyCore Management Engine

Advanced system, power and fabric management for

energy-proportional computing

I/O Controllers Standard drivers, standard interfaces. No surprises.

Processor Complex Multi-core ARM®

processors integrated with high bandwidth memory controllers

EnergyCore Fabric Switch

Integrated high-performance fabric provides inter-node connectivity with industry

standard networking

Page 25: Deview 2013   rise of the wimpy machines - john mao

EnergyCore® Fabric (F1/F2) Integrated 80Gb (8x10Gb cross-bar)Fabric Switch: •  Up to 5 external links:

–  Dynamic bandwidth: 1Gb to 10 Gbper link

–  < 200 Nano-Seconds latency, node to node

•  3 internal links (to the SOC): –  2x 10Gb Ethernet ports to the OS –  1x 10Gb Ethernet port to Mgmt –  Transparent to OS and software

•  Topology agnostic

à Eliminates Top-of-Rack-Switch ports & cabling à Enables extreme density; lowers cost and power

Page 26: Deview 2013   rise of the wimpy machines - john mao

So, what can we use this for?

Page 27: Deview 2013   rise of the wimpy machines - john mao

Target Workloads •  Data-Intensive Applications:

– Storage (scale-out, distributed storage) •  i.e. Ceph, Gluster, etc.

– Analytics (NoSQL, MapReduce, distributed databases)

•  i.e. Hadoop, Cassandra, etc.

•  Distributed, State-less Applications – Web Front End – Caching Servers – Content Distribution Networks (CDN)

Page 28: Deview 2013   rise of the wimpy machines - john mao

Use-Case: Storage via Ceph •  Official Ceph “Dumpling”+ release now supports

Calxeda-based platforms •  Initial benchmarks complete (with x86 comparison)

–  Even without optimizations, performance is promising •  Identified optimization areas (under investigation):

–  Potentially use NEON instructions for CRC32 –  Implement zero-copy on OSD’s –  Transition reads/write to bufferlists –  Optimize client side too – librados/librbd

Page 29: Deview 2013   rise of the wimpy machines - john mao

Use-Case: Storage via Ceph

With same number of HDD’s, Calxeda-based system delivers 50% more performance than traditional x86-servers.

Page 30: Deview 2013   rise of the wimpy machines - john mao

The AAEON CRS-200S-2R Advantage An ARM-based, lower cost, higher performance server platform for scale-out storage

Compared to traditional x86-based, 2U rack mount servers, the AAEON CRS-200S-2R server platform is:

ü  35% Lower TCO*

ü  66% Less Rack Space

ü  50% Higher performance

Calxeda’s ARM-based SOCs: •  Energy Efficient

•  More cores per HDD •  Lower system power

•  High Bandwidth Fabric •  Multi-10Gb links for

data-intensive apps

Page 31: Deview 2013   rise of the wimpy machines - john mao

Summary •  Even 64-bit ARM processors are not ideal for

every single workload. •  However, scale-out, data-intensive, workloads

can leverage ARM’s energy-efficiency to provide a significantly better TCO.

•  For the server market (especially with scale-out apps), replacing the CPU core is not enough. –  Look for SOCs that optimize “between the nodes” in a

cluster (e.g. fabric interconnects will help dramatically) •  Interested in joining the “ARM revolution”?

–  Contact us! – John Mao, [email protected]

Page 32: Deview 2013   rise of the wimpy machines - john mao

Thank You!