deview 2013 rise of the wimpy machines - john mao
DESCRIPTION
TRANSCRIPT
Rise of the (Wimpy) Machines Datacenter Efficiency with ARM-based Servers
John Mao!Director of Strategy, Calxeda!
What is the name of the computer system in this movie that tried to end the human-race?
Skynet
Origins of Wimpy Core Computing
• FAWN: A Fast Array of Wimpy Nodes – Project from CMU led by Prof. David Anderson, started in 2008 (acDve through 2012)
– Measure and compare performance per Joule of energy advantages over tradiDonal servers
– Original focus on large distributed key-‐value store applicaDons and use-‐cases (i.e. Amazon Dynamo, LinkedIn’s Voldemort, Facebook’s memcached)
[PublicaDon] hTp://www.sigops.org/sosp/sosp09/papers/andersen-‐sosp09.pdf
[Website] hTp://www.cs.cmu.edu/~fawnproj/
FAWN: A Fast Array of Wimpy Nodes
• Why FAWN? MoDvated by key trends: – Increasing CPU-‐I/O Gap – CPU power consumpDon grows super-‐linearly with speed
– Dynamic power scaling on tradiDonal systems is surprisingly inefficient
FAWN: A Fast Array of Wimpy Nodes
[Photo Credit] h-p://www.cs.cmu.edu/~fawnproj/
1G
2G
3G
4G
5G
FAWN: A Fast Array of Wimpy Nodes
• Multiple generations of hardware used: – 1G (2008)
• Single-core 500MHz AMD Geode LX processor • 256MB DDR SDRAM (400MHz) • 100Mbps Ethernet
– 5G (2012) • Intel Atom D510 – 1.66GHz dual-core w/HT • 2-4GB DDR2 (667MHz) • 100Mbps Ethernet
Key Findings from FAWN Project
“The FAWN cluster achieves 364 queries per Joule — two orders of magnitude be-er than tradiDonal disk-‐based clusters.”
[Source] hTp://www.sigops.org/sosp/sosp09/papers/andersen-‐sosp09.pdf
So what about ARM®? ARM is a good “wimpy” processor & CPU architecture for the datacenter because: 1. Focus on low power: origins in embedded
systems and mobile devices 2. Datacenter focused roadmap: 32-bit CPUs
today, 64-bit CPUs in 1-2 years; increasing performance (with same energy efficiency)
3. Business model: ability to integrate for specific markets and applications
4. Emerging software ecosystem: while not x86, ARM has growing ecosystem
Focus on Low Power • History in targeting energy-sensitive markets:
– Netbooks, Smartbooks, Tablets, Thin Clients – Smartphones, Feature phones – Set-top Box, Digital TV, Blu-Ray players, Gaming
consoles – Automotive Infotainment, Navigation – Wireless base-stations, VoIP phones and
equipment • Design Goals
– Performance, Power, Easy Synthesis
Focus on Low Power
In 2005, about 98% of all mobile phones sold used at least one ARM processor.
As of 2009, due to low power consumpDon the ARM architecture is the most widely used 32-‐bit RISC architecture in mobile devices and embedded systems.
[Source] hTp://en.wikipedia.org/wiki/ARM_architecture
Focus on Low Power
Translating ARM energy-efficiency into the modern datacenter with Cortex-A9:
Workload (on 24 nodes & SSDs)
Total System* Power (Today!)
~Power per ECX-1000 Node (with disk @Wall)
Linux at Rest 130 W 5.4 W phpbench 155 W 6.5 W Coremark (4 threads per SOC) 169 W 7.0 W
Website @ 70% Utilization 172 W 7.2 W
LINPACK 191 W 7.9 W STREAM 205 W 8.5 W
*All measurements done on a 24-node system @1.1GHz, with 24 SSDs and 96 GB DRAM in the Calxeda Lab.
For specific workloads, ECX-1000 can enable a complete 24-node cluster at similar power level as a 2 socket x86.
But, what about performance?
Online Review: Calxeda’s ARM Server Tested
Anandtech chartered review comparing Boston Viridis’ 24-Calxeda ECX-1000 (Cortex-A9) cluster against Intel E5-2650Lsystem.
(March 2012)
http://www.anandtech.com/show/6757/calxedas-arm-server-tested
Calxeda Provides Better Web Throughput Boston Viridis outperforms Xeon E5-2650L by 30% withmore than 15 users. Test is PHPbb running on Apache2 with variable numbers of users (concurrency) generaDng traffic.
Calxeda Provides Lower Response Times Boston Viridis outperforms Xeon E5-2650L by 60% withmore than 15 users. Test is PHPbb running on Apache2 with variable numbers of users (concurrency) generaDng traffic.
Calxeda Provides Highest Performance/Watt Boston Viridis provides 80% more throughput per Watt than Xeon E5. • 10-36% less raw power Test is PHPbb running on Apache2 with variable numbers of users (concurrency) generaDng traffic.
Online Review: Calxeda’s ARM Server Tested
Reviewer’s Key Takeaways: – For scale-out workloads, Calxeda’s ARM-based scale-out
hardware architecture is very promising. – Microbenchmarks show Calxeda ECX-1000 ~10% behind
Intel Atom N2800 @1.86 MHz – “Real World” Application Benchmarking shows 70%+ higher
performance-per-watt than Intel Xeon E5 at mid to high user load – “Calxeda really did it: each server needs about 8.3W (200W/24),
measured at the wall…about 6W (at 1.4GHz) per server node…” – “So on the one hand, no, the current Calxeda servers are no
Intel Xeon killers (yet). However, we feel that Calxeda's ECX-1000 server node is revolutionary technology.”
ARM® Cortex-A15
• Based on ARMv7A architecture – Ensures software application compatibility
with orther Cortex-A processors • LPAE support up to 1TB physical memory • Full hardware virtualization support • From ARM: delivers 2X performance over
Cortex-A9 processor with similar power • big.LITTLE configuration support for
mobile devices
2013 2014 2015
Datacenter Focused Roadmap
Sarita (ARM® Cortex A57) Compatible 64-bit On-Ramp for Early Access and
Ecosystem Enablement
Midway: ECX-2000 (4 Core, ARM® Cortex A15) Performance/$ for Cloud and Analytics
Highbank: ECX-1000 (4 Core, ARM® Cortex A9) Power Efficient Solution for Storage and Web Hosting
Lago (ARM® Cortex A57) Flagship 64-bit Product for a
Broader Application Set
“Triple Play”: 3 Generations of Pin-Compatible SOCs
3rd Generation Calxeda Fabric and I/O
[Source] Calxeda public SOC roadmap (June 2013)
“Midway”: Calxeda ECX-2000
Compared to Calxeda’s Cortex-A9 SOC (ECX-1000), the “Midway” SOC delivers: – 1.5X more single-thread performance – 2X more floating point performance – 3X STREAM (memory b/w) performance – 4X+ more physical memory support (16GB+) – Same performance-per-Watt
Plan to update Anandtech benchmark report
But, ARM doesn’t make/sell SOCs?
ARM® Business Model
• ARM does not make or sell SOC. • Instead, ARM licenses IP and technology
to partners (like Calxeda) who design and build System-on-Chips (SOCs) for various industries and markets.
• Calxeda is focused exclusively on bringing ARM-based technology to the datacenter. – Calxeda provides own IP (e.g. Fabric) as
additional value for servers.
EnergyCore® architecture at a glance A complete building block for hyper-efficient computing
EnergyCore Management Engine
Advanced system, power and fabric management for
energy-proportional computing
I/O Controllers Standard drivers, standard interfaces. No surprises.
Processor Complex Multi-core ARM®
processors integrated with high bandwidth memory controllers
EnergyCore Fabric Switch
Integrated high-performance fabric provides inter-node connectivity with industry
standard networking
EnergyCore® Fabric (F1/F2) Integrated 80Gb (8x10Gb cross-bar)Fabric Switch: • Up to 5 external links:
– Dynamic bandwidth: 1Gb to 10 Gbper link
– < 200 Nano-Seconds latency, node to node
• 3 internal links (to the SOC): – 2x 10Gb Ethernet ports to the OS – 1x 10Gb Ethernet port to Mgmt – Transparent to OS and software
• Topology agnostic
à Eliminates Top-of-Rack-Switch ports & cabling à Enables extreme density; lowers cost and power
So, what can we use this for?
Target Workloads • Data-Intensive Applications:
– Storage (scale-out, distributed storage) • i.e. Ceph, Gluster, etc.
– Analytics (NoSQL, MapReduce, distributed databases)
• i.e. Hadoop, Cassandra, etc.
• Distributed, State-less Applications – Web Front End – Caching Servers – Content Distribution Networks (CDN)
Use-Case: Storage via Ceph • Official Ceph “Dumpling”+ release now supports
Calxeda-based platforms • Initial benchmarks complete (with x86 comparison)
– Even without optimizations, performance is promising • Identified optimization areas (under investigation):
– Potentially use NEON instructions for CRC32 – Implement zero-copy on OSD’s – Transition reads/write to bufferlists – Optimize client side too – librados/librbd
Use-Case: Storage via Ceph
With same number of HDD’s, Calxeda-based system delivers 50% more performance than traditional x86-servers.
The AAEON CRS-200S-2R Advantage An ARM-based, lower cost, higher performance server platform for scale-out storage
Compared to traditional x86-based, 2U rack mount servers, the AAEON CRS-200S-2R server platform is:
ü 35% Lower TCO*
ü 66% Less Rack Space
ü 50% Higher performance
Calxeda’s ARM-based SOCs: • Energy Efficient
• More cores per HDD • Lower system power
• High Bandwidth Fabric • Multi-10Gb links for
data-intensive apps
Summary • Even 64-bit ARM processors are not ideal for
every single workload. • However, scale-out, data-intensive, workloads
can leverage ARM’s energy-efficiency to provide a significantly better TCO.
• For the server market (especially with scale-out apps), replacing the CPU core is not enough. – Look for SOCs that optimize “between the nodes” in a
cluster (e.g. fabric interconnects will help dramatically) • Interested in joining the “ARM revolution”?
– Contact us! – John Mao, [email protected]
Thank You!