Download - Applications of Computing in Industry: What is Low Latency All About? eFX – January 2014

Applications of Computing in Industry:What is Low Latency All About?

eFX – January 2014

Divyakant Bengani

Undergrad degree in Management and IT from Manchester

Vice President at CS, responsible for eFX Core Technologies

Working in the banking industry since 2003 & CS for ~3 years

2

EFX - What do we do?

Cash FX Only Spot, Forwards and Swaps

Continuous Publication of Prices Streaming Executable Rates

Response to Request for Quotes

Acceptance and Booking of Trades

3

Key Statistics

~200 Currency Pairs (E.g EURUSD / GBPJPY etc.) 3 billion prices broadcast a day 60000 trades a day >200 client connections

4

Technologies Used

Java C# for UIs GWT for Web UIs Oracle Coherence Oracle DB Derby DB Azul Zing JVM Low Latency Fix Engine

5

Protocols

Socket Connections Asynchronous JMS Java RMI HTTP (JSON, HESSIAN)

6

Payloads

Google Protobuf Fixed Length Byte Arrays FIX - Industry Standard JMS Map Messages Java Serialization

7

EFX - Overall Architecture

8

Service Discovery

Zero Conf Dynamically add and remove services Applications do not need to know about each other - just

pick up what’s advertised

9

Automated Testing

10

Code Quality Analysis

11

Continuous Integration

12

How to Achieve Low Latency

Corporate Design, HCBC 1 14

Daniel Nolan-Neylan

Graduated from UCL in 2004 Started working at Credit Suisse in 2006

−First, networking for 4 years−Now, Application Developer in FX IT

Different projects:−Distributed caching system for static data−Simplified credit checking library−Pricing and trading gateway (now team lead)

November 2011

Wait a second!

Reminder:

1 second is:−1,000 milliseconds−1,000,000 microseconds−1,000,000,000 nanoseconds

Latency Numbers Every Programmer Should Know

L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns 14x L1 cache Mutex lock/unlock 25 ns Main memory reference 100 ns 20x L2 cache, 200x L1

cache Compress 1K bytes with Zippy 3,000 ns Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms Read 4K randomly from SSD* 150,000 ns 0.15 ms Read 1 MB sequentially from memory 250,000 ns 0.25 ms Round trip within same datacenter 500,000 ns 0.5 ms Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memory Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X

SSD Send packet CA->Netherlands->CA 150,000,000 ns 150 ms

By Jeff Dean: http://research.google.com/people/jeff/

http://research.google.com/people/jeff/

http://research.google.com/people/jeff/

FX Trading – Latency Numbers

250ms – A human responding to price update 30ms – Bank accepting trade 10ms – Credit checking client 9ms – JVM Garbage Collecting 5ms – Persisting a trade to disk 2ms – JMS networking round-trip 1ms – Raw socket networking round-trip 0.5ms – Max wire-to-wire pricing latency 0.05ms – Min pricing latency 0.005ms – Writing price to FIX engine

Optimization Quotes

Michael A. Jackson:“The First Rule of Program Optimization: Don't do it.The Second Rule of Program Optimization (for experts only!): Don't do it yet.”

Rob Pike:“Bottlenecks occur in surprising places, so don't try to second guess and put in a speed hack until you have proven that's where the bottleneck is.”

Where to Optimize? Use Profiler


Measuring Milliseconds and Nanoseconds in Java

Measure time taken for operations and log:−System.currentTimeMillis()

Good for taking a time/date that can be compared against other systems. Accuracy depends on OS, but 1ms accuracy achievable on modern Unix-based OS (Linux)

Bad if more precise measurements are required−System.nanoTime()

Good for sub-millisecond measurements Bad if comparable time with other systems required

−Realistically, need to use both

November 2011


Quote Journalling – log latency of every price

November 2011


Our Soak Test Harness

November 2011


…and the graphs it can produce

November 2011

Removing Millisecond Delays

Identify the longest-running tasks−Usually I/O delays

Disk– Database activity– Synchronous logging– Writing files

Network– Calling network services– Remote services far away (e.g. Across Atlantic

~50ms)

Removing Millisecond Delays (2)

Analyze whether delays can be eliminated−Disk

Database activity -> Use a cache Synchronous logging -> Use asynchronous logging Writing files -> Use buffers and write asynchronously

−Network Calling network services -> Cache where possible Remote services far away -> Co-locate in same place

FX Trading – RFQ Example

E.g. Incoming request for a price, target response time is 10ms−Need to:

Validate request parameters Internally subscribe for prices Obtain a globally unique transaction ID Perform a credit check

How to get all this done in just 10ms?

FX Trading – RFQ Example (2)

Credit check−Old one took 30-200ms−New one takes 5-10ms

Using Caching and Co-location Parallelize all validation Pre-cache prices

−by opening up price streams in advance of being required

Don’t Optimize Too Soon

Remember:−Only optimize what you need to optimize−Remove longest delays first

No point removing micros if you still have delays of millis or worse

−Always measure your operations carefully Determine what minimum, maximum, mean, standard

deviation, and other percentiles are (99%, 99.9%, etc)−Watch for jitter and solve separately

Removing Microsecond Delays

Intra-process delays−Unbalanced / slow queues−Slow algorithms

Expensive loops repeated many times Poor use of object creation / memory allocation Contented memory controlled with locks Wasted effort calculating unwanted results

FX Trading – Pricing Example

Achieving wire-to-wire latencies of 50μs−Google protobuf parsers replaced with low-garbage

creating versions each GC stops the JVM for 9,000μs (i.e. 9ms)

−LMAX Disruptors used instead of queues Busy spin consumer threads / single-write principle

−“PriceBigDecimal” class to replace Java BigDecimal class BigDecimal slow to instantiate and impossible to

mutate−No synchronous logging or network calls−Pre-cache static data before starting price stream


Disruptor or Blocking Queues?

November 2011


Java BigDecimal or use Low Latency replacement?

November 2011

Removing Nanoseconds?

Use specialist hardware (such as FPGA) Understand low-level CPU interconnectivity with memory,

and how CPU caching works (including cache-lines) http://mechanical-sympathy.blogspot.com eFX – No need to pursue this level of performance at the

moment

http://mechanical-sympathy.blogspot.com/

Latency vs Throughput

Latency - time taken (typically mean, percentile or worst case) to complete a task

Throughput – the number of tasks completed in a given time period (typically, per second)

Throughput is 1/latency (per pipeline)

Increasing Throughput

Identify delays−Throughput constrained by latency−Blocking I/O calls delay unprocessed messages

Data bursts−What’s the peak throughput required?−What’s the gap typically between bursts?

Techniques to Increase Throughput

Batching−Sometimes latent calls are unavoidable−Using batching can strip overhead of making call per

transaction−Cost of batching is the delay incurred waiting for new

items to add to batch−More difficult to accurately measure delay per item when

multiple items are in a batch

FX Trading – Batching Example

Legacy global server in LondonRegional trade acceptance componentsLatency between New York and London - 50msPer thread: 1/0.05 = 20 trades per second

maxHow to increase?

−More threads−Add batching per thread

Now, with batch size of 5, 100 trades per second per thread.

Techniques to Increase Throughput(2)

Use Asynchronous callbacks−Synchronous calls:

boolean doCall() Wait for response Can be delayed for varying time

−Asynchronous calls: void doCall(Callback callback) Do not wait and keep processing more events Can additionally overlay timeouts to improve resilience

FX Trading – Asynchronous Callbacks

Submission of trade to price service for verification – was originally synchronous

Call blocks for 50ms – max 20 trades per second per thread After converting to asynchronous callbacks, the only delay

is putting packets on network buffer (μs), so effectively no delay – max numbers of trades is very high!

Q & A

eFX – January 2014

Download - Applications of Computing in Industry: What is Low Latency All About? eFX – January 2014

Top Related