session 9 patrickgreene
TRANSCRIPT
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
HP Gen8 technologies for low latency, high performance trading and exchanges
Patrick Greene Solution Architect – HP HPC on Wall Street 9/19/12
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 2
Experience matters
#1 in x86 server market share 16+ years straight – 65 consecutive quarters in both factory revenue and units
#1 in blade server market share 5 ¾ years straight – 23 consecutive quarters in both factory revenue and units
HP’s leadership in the datacenter that has been built over years of innovation, experience and market leadership.
HP ProLiant
Source: IDC Worldwide Quarterly Server Tracker, August 2012. Includes Compaq ProLiant from Q196 through Q202
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 3
FSI-HPCTM Solutions for Capital Markets
• Ultra Low Latency Systems for High Frequency Trading
• fastest XeonTM
performance
• tuning White Paper and HP-TimeTest utility
• HP/MellanoxTM
TCP/UDP kernel bypass
• Low power choices for grid computing
• Open reference architecture for unstructured data
• Quality infrastructure for IT cost reduction
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 4
Low Latency Systems Require Optimization at every layer in the Solution Stack
Use Cases
Exchange Matching Engines
Market Data Distribution
High Frequency Algorithmic Trading
Pre/Post Trade Analytics
Real Time Enterprise Risk Management
Definitions: Solution - includes messaging middleware; in-house apps; design services System - integrated server/networking/storage infrastructure Components - specific servers/OS/switches/file system in the “system”
X86-64 Server Architecture
Firmware and Operating System
Integrated Acceleration
High Speed Storage
Low Latency FSI Solution Stack
Server I/O Fabric
Messaging Middleware
Application Environment
Fab.
Mg
mt
Precisio
n T
imin
g
Use Cases / Lines of Business
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 5
Optimized Form Factors to
meet a variety of needs • DL rack-mount servers for expandability
• All top bin E5-2600 processors offered with 3DPC
• DL380p option for 25 disks in 2U 2P Gen8
• BL systems with integrated networking
• Integrated chassis system for redundancy & TCO
• Gen8 NIC/Switch options leveraging PCIe Gen3
• SL multi-node systems for scale-out grids
• Optimized for performance, power and price at scale
• ML mini-tower for ultimate expandability
• ML350 model (rack mount or mini tower) for even more disk, 9 PCI slots!
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 6
HP Gen8 Servers (Sandy Bridge E5-2600)
Three top bin Processors circled
• 8c 3.1GHz in HP Z820 workstation (4U with racking kit; no iLO4)
• 8c 2.9GHz and 4c 3.3GHz in DL380p (2U) and DL360p (1U)
• 130 watt 8c & 6c in BL460c (16 in 10U), SL230 (8 in 4U), and SL250 (4 in 4U)
• Turbo Boost deserves a fresh look (e.g. +400 MHz)
DL380p 8SFF Model
w/optional 8SFF
hot swap drives
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 7
DIMM Description 1DPC (DDR3-) 2DPC (DDR3-) 3DPC (DDR3-)
8GB 2Rx4 PC3-12800R 1.5V DDR3-1600 RDIMM 1600 1600 1333 1
16GB 2Rx4 PC3-12800R 1.5V DDR3-1600 RDIMM 1600 1600 1333 1
4GB 2Rx8 PC3-12800E 1.5V DDR3-1600 UDIMM 1600
8GB 2Rx8 PC3-12800E 1.5V DDR3-1600 UDIMM 1600
Fastest Memory: ProLiant Gen8 DIMMs Intel E5 (SB) = 4 memory channels, so 2p servers have 8 channels with 2 or 3 DPC
8 Dual Rank DIMMs are optimum if it meets your memory capacity requirements Explanation: The memory bus is forced to idle for one clock when switching between ranks on the same DIMM, and idle for 2 clocks when switching between ranks on different DIMMs. So 1 DPC out performs 2DPC at the same capacity and same number of ranks on the channel.
UDIMMs offer a 1 clock latency advantage when only 1 DIMM per Channel (DPC) Unregistered DIMMs UDIMM failure rates are higher, so use these judiciously
New 4 June, 2012
Why dual rank?
For the same memory speed and DIMM type, more ranks will result in lower loaded latency. We enable rank interleaving when dual-rank DIMMs are installed on a channel. So more ranks give the memory controller a greater capability to parallelize the processing of memory requests. This results in shorter request queues and therefore lower latency.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 8
Platform Tuning Advice for Low Latency Updated White Paper: Configuring and Tuning HP ProLiant Servers for Low-Latency Applications
Posted at http://h20000.www2.hp.com/bc/docs/support/SupportManual/c01804533/c01804533.pdf
Disable Power and CPU Monitoring SMI
Eliminate 8x/sec latency spike on managed servers from this System Management Interrupt (SMI) of magnitude >200msec
Turns off P-state monitoring so server always runs at full speed
Consider Disabling Memory Pre-Failure Notification SMI
Eliminates an SMI that occurs once per 5 min for Gen8 and once/hour for G7;.
Correctible and uncorrectable memory error handling is unaffected by turning off notification of the # of correctible errors made
Do this with the new HPRCU, Conrep scripting tool or RBSU Advanced Menu
Conrep now available for Solaris too
See User Guide for ROM-Based Setup Utility (RBSU) for explanation of BIOS settings
Pub #347563-405 June, 2012 at: http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00191707/c00191707.pdf
Run HP-TimeTest utility v7.2 for a quick jitter check
Request free utility via e-mail to [email protected] Include your company name, city/country, and HP sales rep/reseller if known so that the right regional person can respond.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 9
The Benefit of Low Latency Tuning – minimized jitter Plots of HP-TimeTest output:
with current LL tuning on SNB, we observe spikes <9 μsec
0
1
2
3
4
5
6
7
8
9
0
5000
10000
15000
20000
25000
0 30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 510 540 570 600
late
ncy
(μse
cs)
late
ncy
(cycles)
Elapsed Time (seconds)
Latency Spikes: Time History, DL380p Gen8, E5-2643 @ 3.300 GHzRHEL 6.2/2.6.32-220.el6.x86_64, HP-TimeTest7.2
bootleg BIOS (06/22/2012)
spike (cycle)
Jitter observed in 1.5 – 2.5 microsecond range !
0
1
2
3
4
5
6
7
8
9
0
5000
10000
15000
20000
25000
0 30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 510 540 570 600
late
ncy
(μse
cs)
late
ncy
(cyc
les)
Elapsed Time (seconds)
Latency Spikes: Time History, DL380p Gen8, E5-2643 @ 3.300 GHzRHEL 6.2/2.6.32-220.el6.x86_64, HP-TimeTest7.2
spike (usec)
Jitter observed in 7-8 microsecond range
threshold set to 3 msec
threshold set to 1.5 msec
with prototype HP BIOS option for SNB memory power refresh, we observe spikes <3 μsec ! (to be released mid-Oct’12)
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 10
Why PCIe Gen3 matters...
ProLiant Gen8 servers with ConnectX-3 based Adapters and VMA acceleration enable
2msec trading advantage!
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 11
VMA v6 - TCP – Improved Capability In ConnectX-3 Feature CX-2 CX-3 Description
Connection Steering MAC+IP per process in addition to Server MAC+IP
No additional MAC+IP. Use Server’s MAC+IP
ConnectX-3 implements Flow Steering
Multithread support QP per process Multi-threaded applications will share same CQ
QP per thread/socket ConnectX-3 Flow Steering enables finer performance tuning and optimizations
DHCP Not supported Supported
Bonding & HA Not supported Supported (Q1’12)
VLAN Not supported Supported
IP routing gateway Single default GW is supported per process and requires per process configuration
Host stack routing table is supported
ConnectX-3 Flow Steering enables utilizing the host IP stack
CX-3 Introduces 40GbE!
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 12
0
1
2
3
4
5
6
7
8
9
10
1 2 4 8 16 32 64 128 256 512 1024
Late
ncy (
usec
)
Message Size (Bytes)
TCP Latency Improvement (Netperf 10GbE)
G7 X5687 3.6GHz ConnectX-2
G7 X5687 3.6GHz ConnectX-3/VMA
Gen8 E5-2690 2.9GHz ConnectX-3/VMA
Back-to-back configuration (no Switch), ½ Round Trip; Netperf v2.5.0; MTU size = 1470 Bytes
RHEL 6.1; ConnectX-3 FW 2.10.2220; Driver: OFED-VMA 1.5.3-0008; VMA 6.1.6
Command Line: netperf -n 16 -H <peer ip> -c -C -P 0 -t TCP_RR -l 10 -T 2,2 -- -r <message size>
HP/Mellanox Solution now accelerates TCP as well as UDP
protocols
½ RT
Latency
(msec)
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 13
Application Accelerator Options FSI customers use accelerators for faster feed handlers, order execution engines, and compute-intensive risk &
pricing calculations
ISS/HPC team helps certify accelerators in ProLiant
Computational accelerator partner FPGAs:
• NVIDIA (SL2X0 Gen8 with Tesla cards)
HFT accelerator solution partners:
• ActivFinancial (OEMs DL380)
• Tervella (OEMs DL380)
• Ulink (OEMs DL160)
Gen8 servers enhance our support for accelerators
• DL380p risers now supports double wide HL PCIe cards with aux power cable options at PCIe Gen3 speeds!
Rapid changes underway: FPGA vendors adding 10GbE; 10GbE vendors adding FPGAs; switches adding FPGAs…
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 14 THE GOLDEN TICKET: Above the noise.
Application Programming for Low Latency
Determine how many cores your trading strategy requires Can it run on 8 cores? If so, match up CPU+NIC per strategy
Maximize your Application resources by doing the following: 1. Bind threads, interrupts and processes to cores using CPU_ID /usr/bin/taskset –c 0,1 /usr/bin/numactl --localalloc …. (other command line options)
or use Red Hat “tuna” to do this with GUI (in RHEL 5.5 MRG and RHEL 6.0 standard)
Beginning with SandyBridge on-chip PCIe controllers, bind NICs to cores for minimum QPI latencies
2. Program memory accesses for NUMA awareness See: http://bizsupport2.austin.hp.com/bc/docs/support/SupportManual/c03261871/c03261871.pdf
3. Place “communication” functions threads on adjacent cores
3. Use PCM to determine L3 Cache misses & keep data in L3 Cache
http://software.intel.com/file/41604
4. Compile with Performance Settings, Use PGO, Evaluate IPP / SSE 4.2 Strings
http://software.intel.com/en-us/articles/using-avx-without-writing-avx-code/
Implement application-transparent multicast acceleration between nodes, Link Mellanox’s VMA v6 library to the application for kernel bypass over Ethernet and IB (HP now resells VMA)
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 15
• Ultra low latency systems for High Frequency Trading
• Low power choices for grid computing
• SL200s servers with GPU options
• Moonshot program for ARM, Atom, Phi
• Open reference architecture for unstructured data
• Quality infrastructure for IT cost reduction
FSI-HPCTM Solutions for Capital Markets
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 16
Built on ProActive Insight Architecture
Demonstrating the value of SL6500 servers
• Purpose-built for HPC performance at scale
• Up to 1 integrated I/O Accelerator
• Maximum speed FDR IB FlexibleLOM
• Multi-node 1/2U density and efficiency
• Enhanced, simple front serviceability
• Rack level power management
• Industry Leading Mgmt with Insight Control*
HPC optimized for maximum performance,
efficiency and density
HPC optimized for efficiency and density, with balanced
GPU performance
• Purpose-built for HPC performance at scale
• Up to 3 integrated GPUs
• Maximum speed FDR IB FlexibleLOM
• Multi-node 1U density and efficiency
• Enhanced, simple front serviceability
• Rack level power management
• Industry Leading Mgmt with Insight Control*
SL230s SL250s
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 17
GPU Direct RDMA (previously known as GPU Direct 3.0)
Enables peer to peer communication directly between HCA and GPU
Dramatically reduces overall latency for GPU to GPU communications by bypassing the host CPU’s memory
“GPUDirect RDMA” for Peer-to-Peer I/O
GPU
Mellanox VPI
System
Memory
CPU GPU
GDDR5
Memory
PCI Express 3.0
System
Memory
GPU
GDDR5
Memory
PCI Express 3.0
CPU
Mellanox
HCA Mellanox
HCA Availability: GPUDirect RDMA requires CUDA 5.0 and MLNX_OFED driver changes (beta 9/12 with expected GA by 12/12).
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 18
HP/Nvidia Gen 8 GPU Starter Kit V2.0 in Americas
– Configuration: • 1 DL380 control node w/ E5-2670 8 core 2.6GHz 115WCPUs, 64 GB RAM and 2x 600 GB HDD
• 1 SL6500 enclosures
• 4 SL250s 2u server trays w/ E5-2670 8 core 2.6GHz 115W CPUs, 64 GB RAM, 600 GB HDD, 2 Nvidia M2090 GPU modules
• Mellanox IB 4x QDR 36 port managed switch
• HPN ProCurve 2910 24 port 10/100/1000 Ethernet switch
• RHEL
• CMU
• Linux Value Pack
• Rack and infrastructure
• Hardware/Software Integration
– Development Environment for commercial, enterprise, Higher Ed, ISVs
– CUDA Programming Environment
– Proof-of-concept environment for channel partners
– End-user Price ~$70K
– Contact HP Sales for detailed BOM
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 19
Throughput-bound applications pervade the trading lifecycle
• Strategy Development and Testing
• Strategy Execution
• Other latency sensitive apps
• Data Storage and Analysis
• Post Trade Analysis and Compliance - Full trade history
logs and analytics
- Venue latencies
- Transaction Costs
- Risk Analytics
- Historical market data
- Firm-wide log consolidation
- Data Publishing for on-demand large analytics
- Back Testing
- Optimization
- Search
- Matching
- Execution
- Online Risk Management
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 20
20
A New Era of Extreme Scale Computing From tens of servers per rack sharing nothing to thousands sharing everything
HP Redstone Sever Development Platform
HP Discovery Lab Proof of Concept Lab
HP Pathfinder Program Partner Collaboration
HP Project Moonshot Infrastructure Federated
Management, Fabric, Storage Networking, Power/Cooling
HP Converged
Infrastructure
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 21
21
HP ‘Redstone’ Server Development Platform Perfect for development and testing with unparalleled density, flexibility, and simplicity
ProLiant SL 6500s chassis
HP ‘Redstone’ Development Platform Server tray
Up to 72 servers in a single 1U tray
4 trays in a single 4U chassis
Shared SL 6500 scalable system enclosure
• Pooled power—4 common slot power supplies
• Shared cooling—8 shared fans, N+1, rear-serviceable
• Integrated, configurable network fabric with up to 16 10Gb uplinks
Up to 288 servers—18 quad node compute
cartridges per server tray
• Calxeda EnergyCore ™ quad-core ARM SoCs w/4MB L2 cache
• Up to 4GB ECC (up to 1333mhz) memory per server
• Integrated management
Shared and configurable storage
• Diskless or up 4 SATA drives (1 drive cartridges) per server
• Up to 192 SSD or 96 2.5” SFF HDD per enclosure
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 22
Energy, cost and space savings move the industry to new infrastructure
Breakthrough Savings and Simplicity
Traditional x86
$3.3M HP ‘Redstone Server’
$1.2M
89% less energy 94% less space 63% less cost
97% less complexity
400 servers 10 racks
20 switches 1,600 cables 91 kilowatts
1,600 servers 1/2 rack
2 switches 41 cables
9.9 kilowatts
Select hyperscale web, and data analytics applications show tremendous promise
© 2011 HP Confidential NDA Required Based on weighted average performance projections for workloads such as web serving, memcached, and Data Analytics. Cost estimates include infrastructure, space, and power and cooling costs over three years.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 23
• Ultra low latency systems for High Frequency Trading
• Low power choices for grid computing
• Open reference architecture for unstructured data
• scalable Hadoop clusters with CMU
• analysis with Vertica and Autonomy
• Quality infrastructure for IT cost reduction
FSI-HPCTM Solutions for Capital Markets
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 24
Your data is going unstructured
What is Hadoop?
The digital universe will expand by almost half in 2012 - 90% of that data is unstructured
Traditional systems are not designed to analyze unstructured data
Hadoop is designed specifically to extract business value from unstructured data
Risk Modeling Fraud Detection Customer Retention Sentiment Analysis Web Mining
Financial Services Government Retail Telecom Media
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 25
Operating System
Click Stream Analysis using Hadoop, Vertica and Autonomy
How does Hadoop fit into existing BI ecosystems
Hadoop Distributions (Cloudera, MapR,
Hortonworks)
HP Converged Infrastructure
HP
In
sig
ht
CM
U
Consulting Services
Meaning Based Analytics
User segmentation
Software testing
Market research
Vertica
Autonomy IDOL
Ad hoc SQL Compliant Analytics Business
Users
Multi-dimensional analysis
Predictive analysis
Geographical analysis
Data Assimilation
Data Consolidation, Aggregation
Transformation into structured data
Unstructured Click Stream Data
Navigation paths
Time per page
Products Browsed
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 26
HP offers the shortest route to Hadoop success
Open strategy that combines Hadoop with advanced analytics and management
Seamless analytics
Leading Distributions
Insight cluster management utility
Consulting Services
Choice of solutions
• Deploy in days, not months
• Scale to thousands of nodes with the push
of a button
• Manage with single pane of glass
• Optimize with real time and 3-D historical
views of compute resources
• Perform end to end analytics
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 27
27
Address the explosion of data permeating the data center
HP HyperStorage Server
ProLiant SL 4500
Shared SL 4500 HyperStorage chassis
• Pooled power — 4 HP common slot power supplies
• Shared cooling — 10 shared fans, N+1, rear-serviceable
• Shared management — Reduced cabling with single iLo port
Most dense storage available in market today
• Up to 60 LFF drives in a single chassis giving a total of 180 TB of
available storage
Multiple configurations available
• Single server model gives the most dense storage solution for
massive data stores
• Triple server gives users optimal mix of storage and compute for
working inside large unstructured datasets
• Dual server provides an optimal mix of high density storage and
compute
Single node
Dual node
Triple node
180TB Storage
2 x 75TB Storage
3 x 45TB Storage
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 28
Three Node vs. Traditional Similar Deployment HP ProLiant SL4500 Solution Efficiency
vs.
vs.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 29
SL45xx Overview and Features
Designed for Density
First HP ProLiant server, built purely with storage intensive applications in mind
Densest HDD option in HP ProLiant portfolio
Various configurations allow customer selection for optimization for their unique data center needs
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 30
• Ultra low latency systems for High Frequency Trading
• Low power choices for grid computing
• Open reference architecture for unstructured data
• Quality infrastructure for IT cost reduction
• ProActive Insight Architecture
• Performance Optimized Datacenters
FSI-HPCTM Solutions for Capital Markets
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 31
HP ProLiant Gen8: The World’s Most Self-Sufficient Servers
31
3X
Admin productivity improvement
6X
Performance increase for the most
demanding workloads
70%
More compute per watt
66%
Faster time to problem resolution
Integrated Lifecycle
Automation
Dynamic Workload
Acceleration
Proactive Service & Support
Automated Energy
Optimization
With HP ProActive Insight architecture:
Serviceabilty with Quality: www.youtube.com/watch?v=AZw-LG-oyDU
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 32
Designed to Simplify, Integrate and Automate your Infrastructure HP ProActive Insight Architecture
HP FlexNet Adapters
HP Smart Storage
Insight Online
iLO4 Management Engine
Datacenter Smart Grid Virtual Connect
Sea of Sensors 3D
“ProLiant” Operating Environment
Integrated Lifecycle Automation / Dynamic Workload Acceleration / Automated Energy Optimization / ProActive Service and Support
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 33
Gen8 Smart Array Innovations
Faster access to data • Up to 2X performance improvement* • 2X Write Cache (up to 2 GB)
Address explosive data growth • 2X # of Drives supported (up to 227)
Minimize data loss • Long term data retention with Flash Backed Write Cache standard
Reduce initial setup time • 95% reduction in parity initialization from several days to 5 hours**
*256KiB, Sequential write, RAID 5 with 15K SAS drives, performance will vary based on configuration ** HP R & D, Validation information TBD
External model with SAS cable connectors for extending the RAID set to JBODs
Increased performance, data availability and storage capacity
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Thank you