high performance python - marc garcia

31
High Performance Python Marc Garcia February 19, 2015 Barcelona Python Meetup 1 / 31 High Performance Python - Marc Garcia - Barcelona Python Meetup N

Upload: marc-garcia

Post on 15-Jul-2015

356 views

Category:

Software


1 download

TRANSCRIPT

High Performance Python

Marc Garcia

February 19, 2015

Barcelona Python Meetup

1 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Overview

1 Warm up example

2 Some theory

3 Profiling

4 Speeders

5 Summary

2 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Warm up example

3 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Warm up example

Can we optimize this?

def list_numbers(until):’’’Returns a string representing the sequence of numbers from 1 to ‘until’

>>> list_numbers(10)’1, 2, 3, 4, 5, 6, 7, 8, 9, 10’’’’num_list = []for i in range(until):

num_list.append(str(i+1))return ’, ’.join(num_list)

%timeit _ = list_numbers(int(1e6))1 loops, best of 3: 461 ms per loop

Without using a list comprehension first

4 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Warm up example

Some tricks...

def list_numbers_opt(until):’’’Returns a string representing the sequence ofnumber from 1 to ‘until’

>>> list_numbers(10)’1, 2, 3, 4, 5, 6, 7, 8, 9, 10’’’’num_list = []local_str = str # <− first variable lookup is local (avoiding fallback)num_list__append = num_list.append # <− avoiding attribute lookupfor i in range(1, until+1):

num_list__append(local_str(i)) # <− avoiding sum in the loopreturn ’, ’.join(num_list)

%timeit _ = list_numbers_opt(int(1e6))1 loops, best of 3: 323 ms per loop

5 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Warm up example

With a list comprehension

def list_numbers_comprehension(until):’’’Returns a string representing the sequence ofnumber from 1 to ‘until’

>>> list_numbers(10)’1, 2, 3, 4, 5, 6, 7, 8, 9, 10’’’’local_str = strreturn ’, ’.join([local_str(num) for num in range(1, until+1)])

%timeit _ = list_numbers_comprehension(int(1e6))1 loops, best of 3: 311 ms per loop

6 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Warm up example

With map function

def list_numbers_map(until):’’’Returns a string representing the sequence ofnumber from 1 to ‘until’

>>> list_numbers(10)’1, 2, 3, 4, 5, 6, 7, 8, 9, 10’’’’return ’,’.join(map(str, range(1, until+1)))

%timeit _ = list_numbers_map(int(1e6))1 loops, best of 3: 274 ms per loop

7 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Warm up example

Comparison

Approach Time (absolute) Time (relative)Not optimized 461 ms 1.68Optimized 323 ms 1.18List comprehension 311 ms 1.14Map function 274 ms 1.00

8 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Some theory

9 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Some theory

Types of optimizations

CPU boundBetter algorithmsMinimization of in-loop tasks†

Better compilation / low-level optimizations†

I/O boundI/O (disk, network, etc.) access optimizationCompressionMultithreading†

Memory boundMemory access optimization / Use of caches†

Compression

Programmer boundCode readability, styles, etc.Use of libraries

10 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Some theory

Low-level optimizations

In Python, we do not want to implement low-level optimizations ourselves.But we can profit of the ones existing in libraries.

Write your program so it can be optimized

Vectorization (avoid loops)

map instead of list comprehensions

Objects or dicts?

Generators instead of lists

11 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Some theory

Multithreading

GIL (Global Interpreter Lock)1: No multicore

Is released only for:

I/O operations

numpy operations2

So, it’s only possible to parallelize these operations

1https://wiki.python.org/moin/GlobalInterpreterLock2http://wiki.scipy.org/ParallelProgramming

12 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Some theory

Memory performance (I)

1

1Source: http://www.edn.com/Home/PrintView?contentItemId=4397051

13 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Some theory

Memory performance (II)

1

1Source: https://dl.dropboxusercontent.com/u/3967849/sfmu/pub/index.html

14 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Some theory

Memory performance (III)

Optimal use of CPU cache when possible (Numexpr1)

Reusing dataSequential data

Preallocate (Zero Buffer2)

while True:data = os.read(fd, 1024) # os.read allocates memoryprint data.lstrip()

1https://github.com/pydata/numexpr2http://zero-buffer.readthedocs.org/en/latest/

15 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Profiling

16 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Profiling

Profiling basics

Use %timeit to compare different implementations, and %lprun to test themost expensive part of your program.

Line # Hits Time Per Hit % Time Line Contents==============================================================

1 def foo(n):2 1 3 3.0 0.0 phrase = ’repeat me’3 1 185 185.0 0.1 pmul = phrase * n4 100001 97590 1.0 32.4 pjoi = ’’.join([phrase for x in xrange(n)])5 1 4 4.0 0.0 pinc = ’’6 100001 90133 0.9 29.9 for x in xrange(n):7 100000 112935 1.1 37.5 pinc += phrase8 1 182 182.0 0.1 del pmul, pjoi, pinc

Performance may change when CPU or RAM are at higher use.

17 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Speeders

18 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Speeders

What are speeders?

Just-in-time (JIT) compilers and others.

Dynamic languages are slower by design, making them somehow staticimproves performance.

a+ b (1)

1 Get a and from memory

2 Get the types of a and b from memory

3 Lookup of add method

4 Allocate memory for the result

5 Store the result in memory

Consider a and b are integers and inside a loop executed million times.There is a huge cost that can be avoided.

19 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Speeders

PyPy

No modification in the code should be required, only changing the interpreter

C extensions need to be recompiled (in some cases modified)

Minor compatibility issues (e.g. __del__ method can’t be added to classafter it has already been created)

python myscript.py

pypy myscript.py

20 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Speeders

Numba

Uses a decorator

Build using LLVM Compilation framework

Caches compiled code, second executions are faster

Some limitations:

Generators not supportedNested functions not supportedDefault arguments not supported"is not" operator not supported

21 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Speeders

C extensions

Implementation is in C, programming time is much higher

Compilation of the extension is required

Overhead due to moving data from Python to C and C to Python

Minimize this by making as few calls with as much data as possible

22 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Speeders

Cython (I)

Cython != CPython

Optimising static compiler: Allows us to write C extensions, but in Python

Main difference with Python code is that types are declared

Cython files need to be compile:

from distutils.core import setupfrom Cython.Build import cythonize

setup(name = ’Hello world app’,ext_modules = cythonize("hello.pyx"),

)

23 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Speeders

Cython (II)

# Pure Pythondef f(x):

return x∗∗2−x

def integrate_f(a, b, N):s = 0dx = (b−a)/Nfor i in range(N):

s += f(a+i∗dx)return s ∗ dx

#Cythondef f(double x):

return x∗∗2−x

def integrate_f(double a, double b, int N):cdef int icdef double s, dxs = 0dx = (b−a)/Nfor i in range(N):

s += f(a+i∗dx)return s ∗ dx

24 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Speeders

Numexpr

Executes code optimizing memory and cache usage

Works well with numpy

Numexpr gets the code as a string

Performance may improve one order of magnitude

import numpy as npimport numexpr as ne

a = np.arange(1e6) # Choose large arrays for better speedups

ne.evaluate("a + 1") # a simple expressionarray([ 1.00000000e+00, 2.00000000e+00, 3.00000000e+00, ...,

9.99998000e+05, 9.99999000e+05, 1.00000000e+06])

25 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Speeders

GPU

Performance can increase two orders of magnitude when using GPU

Extra hardware is required

Implementation require use of parallel programming techniques

Libraries can be used: PyCUDA, NumbaPro

26 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Speeders

Performance comparison

Do not try and find the winner. That’s impossible.Instead... only try to realize the truth.

There is no winnerLook for the solution that works better with your problem.

27 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Summary

28 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Summary

Takeaways

"Premature optimization is the root of all evil". Donald Knuth

Optimize only when necessary, and only the bottlenecks

Vectorize and avoid loops (and focus on them when they are required

Write your programs so they can be optimized

Mostly using libraries

Bottleneck will be memory or I/O will in many cases

Static typing improves performance (numpy, Cython, etc)

29 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Summary

Useful links (I)

TalksRaymond Hettinger talk

https://vimeo.com/114368783 (starting at 21:30)http://bit.ly/python-sfmu

It’s the memory stupid - Francesc Altedhttp://www.slideshare.net/BigDataSpain/francesc-alted-how-i-learned-to-stop-worrying-about-cpu-speed

Fast Python, Slow Python - Alex Gaynorhttps://www.youtube.com/watch?v=7eeEf_rAJds

Twitter

https://twitter.com/raymondhhttps://twitter.com/ContinuumIOhttps://twitter.com/FrancescAlted

30 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N

Summary

Useful links (II)

Speeders

http://pypy.org/http://numba.pydata.org/http://cython.org/https://github.com/pydata/numexprhttp://docs.continuum.io/numbapro/

Memory performance

http://queue.acm.org/detail.cfm?id=2513149

Profiling

http://www.huyng.com/posts/python-performance-analysis/

31 / 31High Performance Python - Marc Garcia - Barcelona Python Meetup

N