session 9 patrickgreene

34
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. HP Gen8 technologies for low latency, high performance trading and exchanges Patrick Greene Solution Architect – HP HPC on Wall Street 9/19/12

Upload: thematrix1

Post on 02-Jan-2016

29 views

Category:

Documents


0 download

TRANSCRIPT

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

HP Gen8 technologies for low latency, high performance trading and exchanges

Patrick Greene Solution Architect – HP HPC on Wall Street 9/19/12

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 2

Experience matters

#1 in x86 server market share 16+ years straight – 65 consecutive quarters in both factory revenue and units

#1 in blade server market share 5 ¾ years straight – 23 consecutive quarters in both factory revenue and units

HP’s leadership in the datacenter that has been built over years of innovation, experience and market leadership.

HP ProLiant

Source: IDC Worldwide Quarterly Server Tracker, August 2012. Includes Compaq ProLiant from Q196 through Q202

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 3

FSI-HPCTM Solutions for Capital Markets

• Ultra Low Latency Systems for High Frequency Trading

• fastest XeonTM

performance

• tuning White Paper and HP-TimeTest utility

• HP/MellanoxTM

TCP/UDP kernel bypass

• Low power choices for grid computing

• Open reference architecture for unstructured data

• Quality infrastructure for IT cost reduction

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 4

Low Latency Systems Require Optimization at every layer in the Solution Stack

Use Cases

Exchange Matching Engines

Market Data Distribution

High Frequency Algorithmic Trading

Pre/Post Trade Analytics

Real Time Enterprise Risk Management

Definitions: Solution - includes messaging middleware; in-house apps; design services System - integrated server/networking/storage infrastructure Components - specific servers/OS/switches/file system in the “system”

X86-64 Server Architecture

Firmware and Operating System

Integrated Acceleration

High Speed Storage

Low Latency FSI Solution Stack

Server I/O Fabric

Messaging Middleware

Application Environment

Fab.

Mg

mt

Precisio

n T

imin

g

Use Cases / Lines of Business

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 5

Optimized Form Factors to

meet a variety of needs • DL rack-mount servers for expandability

• All top bin E5-2600 processors offered with 3DPC

• DL380p option for 25 disks in 2U 2P Gen8

• BL systems with integrated networking

• Integrated chassis system for redundancy & TCO

• Gen8 NIC/Switch options leveraging PCIe Gen3

• SL multi-node systems for scale-out grids

• Optimized for performance, power and price at scale

• ML mini-tower for ultimate expandability

• ML350 model (rack mount or mini tower) for even more disk, 9 PCI slots!

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 6

HP Gen8 Servers (Sandy Bridge E5-2600)

Three top bin Processors circled

• 8c 3.1GHz in HP Z820 workstation (4U with racking kit; no iLO4)

• 8c 2.9GHz and 4c 3.3GHz in DL380p (2U) and DL360p (1U)

• 130 watt 8c & 6c in BL460c (16 in 10U), SL230 (8 in 4U), and SL250 (4 in 4U)

• Turbo Boost deserves a fresh look (e.g. +400 MHz)

DL380p 8SFF Model

w/optional 8SFF

hot swap drives

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 7

DIMM Description 1DPC (DDR3-) 2DPC (DDR3-) 3DPC (DDR3-)

8GB 2Rx4 PC3-12800R 1.5V DDR3-1600 RDIMM 1600 1600 1333 1

16GB 2Rx4 PC3-12800R 1.5V DDR3-1600 RDIMM 1600 1600 1333 1

4GB 2Rx8 PC3-12800E 1.5V DDR3-1600 UDIMM 1600

8GB 2Rx8 PC3-12800E 1.5V DDR3-1600 UDIMM 1600

Fastest Memory: ProLiant Gen8 DIMMs Intel E5 (SB) = 4 memory channels, so 2p servers have 8 channels with 2 or 3 DPC

8 Dual Rank DIMMs are optimum if it meets your memory capacity requirements Explanation: The memory bus is forced to idle for one clock when switching between ranks on the same DIMM, and idle for 2 clocks when switching between ranks on different DIMMs. So 1 DPC out performs 2DPC at the same capacity and same number of ranks on the channel.

UDIMMs offer a 1 clock latency advantage when only 1 DIMM per Channel (DPC) Unregistered DIMMs UDIMM failure rates are higher, so use these judiciously

New 4 June, 2012

Why dual rank?

For the same memory speed and DIMM type, more ranks will result in lower loaded latency. We enable rank interleaving when dual-rank DIMMs are installed on a channel. So more ranks give the memory controller a greater capability to parallelize the processing of memory requests. This results in shorter request queues and therefore lower latency.

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 8

Platform Tuning Advice for Low Latency Updated White Paper: Configuring and Tuning HP ProLiant Servers for Low-Latency Applications

Posted at http://h20000.www2.hp.com/bc/docs/support/SupportManual/c01804533/c01804533.pdf

Disable Power and CPU Monitoring SMI

Eliminate 8x/sec latency spike on managed servers from this System Management Interrupt (SMI) of magnitude >200msec

Turns off P-state monitoring so server always runs at full speed

Consider Disabling Memory Pre-Failure Notification SMI

Eliminates an SMI that occurs once per 5 min for Gen8 and once/hour for G7;.

Correctible and uncorrectable memory error handling is unaffected by turning off notification of the # of correctible errors made

Do this with the new HPRCU, Conrep scripting tool or RBSU Advanced Menu

Conrep now available for Solaris too

See User Guide for ROM-Based Setup Utility (RBSU) for explanation of BIOS settings

Pub #347563-405 June, 2012 at: http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00191707/c00191707.pdf

Run HP-TimeTest utility v7.2 for a quick jitter check

Request free utility via e-mail to [email protected] Include your company name, city/country, and HP sales rep/reseller if known so that the right regional person can respond.

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 9

The Benefit of Low Latency Tuning – minimized jitter Plots of HP-TimeTest output:

with current LL tuning on SNB, we observe spikes <9 μsec

0

1

2

3

4

5

6

7

8

9

0

5000

10000

15000

20000

25000

0 30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 510 540 570 600

late

ncy

(μse

cs)

late

ncy

(cycles)

Elapsed Time (seconds)

Latency Spikes: Time History, DL380p Gen8, E5-2643 @ 3.300 GHzRHEL 6.2/2.6.32-220.el6.x86_64, HP-TimeTest7.2

bootleg BIOS (06/22/2012)

spike (cycle)

Jitter observed in 1.5 – 2.5 microsecond range !

0

1

2

3

4

5

6

7

8

9

0

5000

10000

15000

20000

25000

0 30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 510 540 570 600

late

ncy

(μse

cs)

late

ncy

(cyc

les)

Elapsed Time (seconds)

Latency Spikes: Time History, DL380p Gen8, E5-2643 @ 3.300 GHzRHEL 6.2/2.6.32-220.el6.x86_64, HP-TimeTest7.2

spike (usec)

Jitter observed in 7-8 microsecond range

threshold set to 3 msec

threshold set to 1.5 msec

with prototype HP BIOS option for SNB memory power refresh, we observe spikes <3 μsec ! (to be released mid-Oct’12)

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 10

Why PCIe Gen3 matters...

ProLiant Gen8 servers with ConnectX-3 based Adapters and VMA acceleration enable

2msec trading advantage!

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 11

VMA v6 - TCP – Improved Capability In ConnectX-3 Feature CX-2 CX-3 Description

Connection Steering MAC+IP per process in addition to Server MAC+IP

No additional MAC+IP. Use Server’s MAC+IP

ConnectX-3 implements Flow Steering

Multithread support QP per process Multi-threaded applications will share same CQ

QP per thread/socket ConnectX-3 Flow Steering enables finer performance tuning and optimizations

DHCP Not supported Supported

Bonding & HA Not supported Supported (Q1’12)

VLAN Not supported Supported

IP routing gateway Single default GW is supported per process and requires per process configuration

Host stack routing table is supported

ConnectX-3 Flow Steering enables utilizing the host IP stack

CX-3 Introduces 40GbE!

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 12

0

1

2

3

4

5

6

7

8

9

10

1 2 4 8 16 32 64 128 256 512 1024

Late

ncy (

usec

)

Message Size (Bytes)

TCP Latency Improvement (Netperf 10GbE)

G7 X5687 3.6GHz ConnectX-2

G7 X5687 3.6GHz ConnectX-3/VMA

Gen8 E5-2690 2.9GHz ConnectX-3/VMA

Back-to-back configuration (no Switch), ½ Round Trip; Netperf v2.5.0; MTU size = 1470 Bytes

RHEL 6.1; ConnectX-3 FW 2.10.2220; Driver: OFED-VMA 1.5.3-0008; VMA 6.1.6

Command Line: netperf -n 16 -H <peer ip> -c -C -P 0 -t TCP_RR -l 10 -T 2,2 -- -r <message size>

HP/Mellanox Solution now accelerates TCP as well as UDP

protocols

½ RT

Latency

(msec)

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 13

Application Accelerator Options FSI customers use accelerators for faster feed handlers, order execution engines, and compute-intensive risk &

pricing calculations

ISS/HPC team helps certify accelerators in ProLiant

Computational accelerator partner FPGAs:

• NVIDIA (SL2X0 Gen8 with Tesla cards)

HFT accelerator solution partners:

• ActivFinancial (OEMs DL380)

• Tervella (OEMs DL380)

• Ulink (OEMs DL160)

Gen8 servers enhance our support for accelerators

• DL380p risers now supports double wide HL PCIe cards with aux power cable options at PCIe Gen3 speeds!

Rapid changes underway: FPGA vendors adding 10GbE; 10GbE vendors adding FPGAs; switches adding FPGAs…

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 14 THE GOLDEN TICKET: Above the noise.

Application Programming for Low Latency

Determine how many cores your trading strategy requires Can it run on 8 cores? If so, match up CPU+NIC per strategy

Maximize your Application resources by doing the following: 1. Bind threads, interrupts and processes to cores using CPU_ID /usr/bin/taskset –c 0,1 /usr/bin/numactl --localalloc …. (other command line options)

or use Red Hat “tuna” to do this with GUI (in RHEL 5.5 MRG and RHEL 6.0 standard)

Beginning with SandyBridge on-chip PCIe controllers, bind NICs to cores for minimum QPI latencies

2. Program memory accesses for NUMA awareness See: http://bizsupport2.austin.hp.com/bc/docs/support/SupportManual/c03261871/c03261871.pdf

3. Place “communication” functions threads on adjacent cores

3. Use PCM to determine L3 Cache misses & keep data in L3 Cache

http://software.intel.com/file/41604

4. Compile with Performance Settings, Use PGO, Evaluate IPP / SSE 4.2 Strings

http://software.intel.com/en-us/articles/using-avx-without-writing-avx-code/

Implement application-transparent multicast acceleration between nodes, Link Mellanox’s VMA v6 library to the application for kernel bypass over Ethernet and IB (HP now resells VMA)

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 15

• Ultra low latency systems for High Frequency Trading

• Low power choices for grid computing

• SL200s servers with GPU options

• Moonshot program for ARM, Atom, Phi

• Open reference architecture for unstructured data

• Quality infrastructure for IT cost reduction

FSI-HPCTM Solutions for Capital Markets

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 16

Built on ProActive Insight Architecture

Demonstrating the value of SL6500 servers

• Purpose-built for HPC performance at scale

• Up to 1 integrated I/O Accelerator

• Maximum speed FDR IB FlexibleLOM

• Multi-node 1/2U density and efficiency

• Enhanced, simple front serviceability

• Rack level power management

• Industry Leading Mgmt with Insight Control*

HPC optimized for maximum performance,

efficiency and density

HPC optimized for efficiency and density, with balanced

GPU performance

• Purpose-built for HPC performance at scale

• Up to 3 integrated GPUs

• Maximum speed FDR IB FlexibleLOM

• Multi-node 1U density and efficiency

• Enhanced, simple front serviceability

• Rack level power management

• Industry Leading Mgmt with Insight Control*

SL230s SL250s

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 17

GPU Direct RDMA (previously known as GPU Direct 3.0)

Enables peer to peer communication directly between HCA and GPU

Dramatically reduces overall latency for GPU to GPU communications by bypassing the host CPU’s memory

“GPUDirect RDMA” for Peer-to-Peer I/O

GPU

Mellanox VPI

System

Memory

CPU GPU

GDDR5

Memory

PCI Express 3.0

System

Memory

GPU

GDDR5

Memory

PCI Express 3.0

CPU

Mellanox

HCA Mellanox

HCA Availability: GPUDirect RDMA requires CUDA 5.0 and MLNX_OFED driver changes (beta 9/12 with expected GA by 12/12).

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 18

HP/Nvidia Gen 8 GPU Starter Kit V2.0 in Americas

– Configuration: • 1 DL380 control node w/ E5-2670 8 core 2.6GHz 115WCPUs, 64 GB RAM and 2x 600 GB HDD

• 1 SL6500 enclosures

• 4 SL250s 2u server trays w/ E5-2670 8 core 2.6GHz 115W CPUs, 64 GB RAM, 600 GB HDD, 2 Nvidia M2090 GPU modules

• Mellanox IB 4x QDR 36 port managed switch

• HPN ProCurve 2910 24 port 10/100/1000 Ethernet switch

• RHEL

• CMU

• Linux Value Pack

• Rack and infrastructure

• Hardware/Software Integration

– Development Environment for commercial, enterprise, Higher Ed, ISVs

– CUDA Programming Environment

– Proof-of-concept environment for channel partners

– End-user Price ~$70K

– Contact HP Sales for detailed BOM

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 19

Throughput-bound applications pervade the trading lifecycle

• Strategy Development and Testing

• Strategy Execution

• Other latency sensitive apps

• Data Storage and Analysis

• Post Trade Analysis and Compliance - Full trade history

logs and analytics

- Venue latencies

- Transaction Costs

- Risk Analytics

- Historical market data

- Firm-wide log consolidation

- Data Publishing for on-demand large analytics

- Back Testing

- Optimization

- Search

- Matching

- Execution

- Online Risk Management

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 20

20

A New Era of Extreme Scale Computing From tens of servers per rack sharing nothing to thousands sharing everything

HP Redstone Sever Development Platform

HP Discovery Lab Proof of Concept Lab

HP Pathfinder Program Partner Collaboration

HP Project Moonshot Infrastructure Federated

Management, Fabric, Storage Networking, Power/Cooling

HP Converged

Infrastructure

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 21

21

HP ‘Redstone’ Server Development Platform Perfect for development and testing with unparalleled density, flexibility, and simplicity

ProLiant SL 6500s chassis

HP ‘Redstone’ Development Platform Server tray

Up to 72 servers in a single 1U tray

4 trays in a single 4U chassis

Shared SL 6500 scalable system enclosure

• Pooled power—4 common slot power supplies

• Shared cooling—8 shared fans, N+1, rear-serviceable

• Integrated, configurable network fabric with up to 16 10Gb uplinks

Up to 288 servers—18 quad node compute

cartridges per server tray

• Calxeda EnergyCore ™ quad-core ARM SoCs w/4MB L2 cache

• Up to 4GB ECC (up to 1333mhz) memory per server

• Integrated management

Shared and configurable storage

• Diskless or up 4 SATA drives (1 drive cartridges) per server

• Up to 192 SSD or 96 2.5” SFF HDD per enclosure

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 22

Energy, cost and space savings move the industry to new infrastructure

Breakthrough Savings and Simplicity

Traditional x86

$3.3M HP ‘Redstone Server’

$1.2M

89% less energy 94% less space 63% less cost

97% less complexity

400 servers 10 racks

20 switches 1,600 cables 91 kilowatts

1,600 servers 1/2 rack

2 switches 41 cables

9.9 kilowatts

Select hyperscale web, and data analytics applications show tremendous promise

© 2011 HP Confidential NDA Required Based on weighted average performance projections for workloads such as web serving, memcached, and Data Analytics. Cost estimates include infrastructure, space, and power and cooling costs over three years.

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 23

• Ultra low latency systems for High Frequency Trading

• Low power choices for grid computing

• Open reference architecture for unstructured data

• scalable Hadoop clusters with CMU

• analysis with Vertica and Autonomy

• Quality infrastructure for IT cost reduction

FSI-HPCTM Solutions for Capital Markets

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 24

Your data is going unstructured

What is Hadoop?

The digital universe will expand by almost half in 2012 - 90% of that data is unstructured

Traditional systems are not designed to analyze unstructured data

Hadoop is designed specifically to extract business value from unstructured data

Risk Modeling Fraud Detection Customer Retention Sentiment Analysis Web Mining

Financial Services Government Retail Telecom Media

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 25

Operating System

Click Stream Analysis using Hadoop, Vertica and Autonomy

How does Hadoop fit into existing BI ecosystems

Hadoop Distributions (Cloudera, MapR,

Hortonworks)

HP Converged Infrastructure

HP

In

sig

ht

CM

U

Consulting Services

Meaning Based Analytics

User segmentation

Software testing

Market research

Vertica

Autonomy IDOL

Ad hoc SQL Compliant Analytics Business

Users

Multi-dimensional analysis

Predictive analysis

Geographical analysis

Data Assimilation

Data Consolidation, Aggregation

Transformation into structured data

Unstructured Click Stream Data

Navigation paths

Time per page

Products Browsed

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 26

HP offers the shortest route to Hadoop success

Open strategy that combines Hadoop with advanced analytics and management

Seamless analytics

Leading Distributions

Insight cluster management utility

Consulting Services

Choice of solutions

• Deploy in days, not months

• Scale to thousands of nodes with the push

of a button

• Manage with single pane of glass

• Optimize with real time and 3-D historical

views of compute resources

• Perform end to end analytics

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 27

27

Address the explosion of data permeating the data center

HP HyperStorage Server

ProLiant SL 4500

Shared SL 4500 HyperStorage chassis

• Pooled power — 4 HP common slot power supplies

• Shared cooling — 10 shared fans, N+1, rear-serviceable

• Shared management — Reduced cabling with single iLo port

Most dense storage available in market today

• Up to 60 LFF drives in a single chassis giving a total of 180 TB of

available storage

Multiple configurations available

• Single server model gives the most dense storage solution for

massive data stores

• Triple server gives users optimal mix of storage and compute for

working inside large unstructured datasets

• Dual server provides an optimal mix of high density storage and

compute

Single node

Dual node

Triple node

180TB Storage

2 x 75TB Storage

3 x 45TB Storage

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 28

Three Node vs. Traditional Similar Deployment HP ProLiant SL4500 Solution Efficiency

vs.

vs.

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 29

SL45xx Overview and Features

Designed for Density

First HP ProLiant server, built purely with storage intensive applications in mind

Densest HDD option in HP ProLiant portfolio

Various configurations allow customer selection for optimization for their unique data center needs

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 30

• Ultra low latency systems for High Frequency Trading

• Low power choices for grid computing

• Open reference architecture for unstructured data

• Quality infrastructure for IT cost reduction

• ProActive Insight Architecture

• Performance Optimized Datacenters

FSI-HPCTM Solutions for Capital Markets

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 31

HP ProLiant Gen8: The World’s Most Self-Sufficient Servers

31

3X

Admin productivity improvement

6X

Performance increase for the most

demanding workloads

70%

More compute per watt

66%

Faster time to problem resolution

Integrated Lifecycle

Automation

Dynamic Workload

Acceleration

Proactive Service & Support

Automated Energy

Optimization

With HP ProActive Insight architecture:

Serviceabilty with Quality: www.youtube.com/watch?v=AZw-LG-oyDU

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 32

Designed to Simplify, Integrate and Automate your Infrastructure HP ProActive Insight Architecture

HP FlexNet Adapters

HP Smart Storage

Insight Online

iLO4 Management Engine

Datacenter Smart Grid Virtual Connect

Sea of Sensors 3D

“ProLiant” Operating Environment

Integrated Lifecycle Automation / Dynamic Workload Acceleration / Automated Energy Optimization / ProActive Service and Support

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 33

Gen8 Smart Array Innovations

Faster access to data • Up to 2X performance improvement* • 2X Write Cache (up to 2 GB)

Address explosive data growth • 2X # of Drives supported (up to 227)

Minimize data loss • Long term data retention with Flash Backed Write Cache standard

Reduce initial setup time • 95% reduction in parity initialization from several days to 5 hours**

*256KiB, Sequential write, RAID 5 with 15K SAS drives, performance will vary based on configuration ** HP R & D, Validation information TBD

External model with SAS cable connectors for extending the RAID set to JBODs

Increased performance, data availability and storage capacity

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Thank you

[email protected]