low latency & mechanical sympathy issues and solutions
TRANSCRIPT
Low latency & Mechanical Sympathy:Issues and solutionsJean-Philippe BEMPEL @jpbempelPerformance Architect http://jpbempel.blogspot.com
© ULLINK 2015 – Private & Confidential
UL BRIDGE: Low latency order router
December 16, 2014© ULLINK 2014 – Private & Confidential 2
• pure Java SE application
• FIX protocol / Direct Market Access connectivity (TCP)
• transform to an intermediate form (UL Message)
• process & route orders in less than 100us
Low Latency
December 16, 2014© ULLINK 2014 – Private & Confidential 3
• Process under 1ms
• No (Disk) I/O involved
• Hardware architecture has a major impact on performance
• Memory is the new disk
• Mechanical Sympathy
• Minor GC & Major GC are Stop the World (STW)
• Major: None during trading hours. Java heap sized accordingly
GC pauses
December 16, 2014© ULLINK 2014 – Private & Confidential 5
• Minor: Some tricks to reduce both frequency & pause time.Frequency: increase Young GenMinor pause time factors are:
Refs traversingObject Copy (survivor or tenured)Card scanning
• Allocation should be controlled carefully
• Reduce frequency of minor gc occurrence
• Help to achieve objectives in terms of percentile above 99%
GC pauses
December 16, 2014© ULLINK 2014 – Private & Confidential 6
Why not CMS algorithm?• Promotion allocation (Free List) increase minor pause
time• Major GC fallback unpredictable (fragmentation,
promotion failure, …)
Azul?• We are partners• Best GC algorithm in the world• Our tests show maximum pause time of 2ms• Execution performance not as good as HotSpot but ...
GC pauses
December 16, 2014© ULLINK 2014 – Private & Confidential 7
• CPU embeds technologies to reduce power usage
• Frequency scaling during execution (P states)P0 = nominal frequency
• Deep sleep modes during idle phases (C states)C0 = Running, C1 = idle
• latency to wakeupcat /sys/devices/system/cpu/cpu0/cpuidle/state3/latency
200
• few orders per day, most of the time in idle
Power Management
December 16, 2014© ULLINK 2014 – Private & Confidential 9
• Disable those special states at BIOS level
• Some OS drivers are also very aggressive (intel_idle)
Power Management
December 16, 2014© ULLINK 2014 – Private & Confidential 10
• Non-Uniform Memory Access
• Starts from 2 cpus (sockets)
• 1 Memory controller per socket (Node)
• Local Memory vs remote memory
NUMA
December 16, 2014© ULLINK 2014 – Private & Confidential 16
• Avoid remote memory access
• Avoid scheduling and allocation on different node
• Bind process on one CPU only (numactl)Restrictive but effective
NUMA
December 16, 2014© ULLINK 2014 – Private & Confidential 18
• CPU L3 shared cache across cores
• Non critical threads can “pollute” cache • eviction of data required by critical threads• => adds latency to re-access those data
• Need to isolate critical threads to non-critical ones
Cache Pollution
December 16, 2014© ULLINK 2014 – Private & Confidential 22
• OS Scheduling cannot work in this case
• Need to pin manually threads on cores
• Thread affinity allow us to do this
• Identify our critical and non-critical threads in application
Cache Pollution
December 16, 2014© ULLINK 2014 – Private & Confidential 23
• Other processes can also pollute L3 cache
• Need to isolate all processes away from critical threads
• The whole CPU need to be dedicated
• No OS scheduling should be performed (isolcpus)
Cache Pollution
December 16, 2014© ULLINK 2014 – Private & Confidential 24
• Can optimize a process without changing your code
• Know your hardware and your OS
• This is Mechanical Sympathy
Conclusion
December 16, 2014© ULLINK 2014 – Private & Confidential 25