the high performance python landscape by ian ozsvald

18
www.morconsulting.c om The High Performance Python Landscape - profiling and fast calculation Ian Ozsvald @IanOzsvald MorConsulting.com

Upload: pydata

Post on 06-Dec-2014

880 views

Category:

Technology


0 download

DESCRIPTION

The High Performance Python Landscape by Ian Ozsvald

TRANSCRIPT

Page 1: The High Performance Python Landscape by Ian Ozsvald

www.morconsulting.com

The High Performance Python Landscape - profiling and fast calculation

Ian Ozsvald @IanOzsvald MorConsulting.com

Page 2: The High Performance Python Landscape by Ian Ozsvald

[email protected] @IanOzsvald PyDataLondon February 2014

What is “high performance”?

● Profiling to understand system behaviour● We often ignore this step...

● Speeding up the bottleneck● Keeps you on 1 machine (if possible)

● Keeping team speed high

Page 3: The High Performance Python Landscape by Ian Ozsvald

[email protected] @IanOzsvald PyDataLondon February 2014

“High Performance Python”

• “Practical Performant

Programming

for Humans”

• Please join the mailing

list via IanOzsvald.com

Page 4: The High Performance Python Landscape by Ian Ozsvald

[email protected] @IanOzsvald PyDataLondon February 2014

cProfile

Page 5: The High Performance Python Landscape by Ian Ozsvald

[email protected] @IanOzsvald PyDataLondon February 2014

line_profilerLine #      Hits         Time  Per Hit   % Time  Line Contents

==============================================================

     9                                           @profile

    10                                           def calculate_z_serial_purepython(

                                                      maxiter, zs, cs):

    12         1         6870   6870.0      0.0      output = [0] * len(zs)

    13   1000001       781959      0.8      0.8      for i in range(len(zs)):

    14   1000000       767224      0.8      0.8          n = 0

    15   1000000       843432      0.8      0.8          z = zs[i]

    16   1000000       786013      0.8      0.8          c = cs[i]

    17  34219980     36492596      1.1     36.2          while abs(z) < 2 

                                                               and n < maxiter:

    18  33219980     32869046      1.0     32.6              z = z * z + c

    19  33219980     27371730      0.8     27.2              n += 1

    20   1000000       890837      0.9      0.9          output[i] = n

    21         1            4      4.0      0.0      return output

Page 6: The High Performance Python Landscape by Ian Ozsvald

[email protected] @IanOzsvald PyDataLondon February 2014

memory_profilerLine #    Mem usage    Increment   Line Contents

================================================

     9   89.934 MiB    0.000 MiB   @profile

    10                             def calculate_z_serial_purepython(

                                                     maxiter, zs, cs):                                 

    12   97.566 MiB    7.633 MiB       output = [0] * len(zs)

    13  130.215 MiB   32.648 MiB       for i in range(len(zs)):

    14  130.215 MiB    0.000 MiB           n = 0

    15  130.215 MiB    0.000 MiB           z = zs[i]

    16  130.215 MiB    0.000 MiB           c = cs[i]

    17  130.215 MiB    0.000 MiB           while n < maxiter and abs(z) < 2:

    18  130.215 MiB    0.000 MiB               z = z * z + c

    19  130.215 MiB    0.000 MiB               n += 1

    20  130.215 MiB    0.000 MiB           output[i] = n

    21  122.582 MiB   ­7.633 MiB       return output

Page 7: The High Performance Python Landscape by Ian Ozsvald

[email protected] @IanOzsvald PyDataLondon February 2014

memory_profiler mprofhttps://github.com/scikit-learn/scikit-learn/pull/2248Before & After an improvement

Page 8: The High Performance Python Landscape by Ian Ozsvald

[email protected] @IanOzsvald PyDataLondon February 2014

Transforming memory_profiler into a resource profiler?

Page 9: The High Performance Python Landscape by Ian Ozsvald

[email protected] @IanOzsvald PyDataLondon February 2014

Profiling possibilities

● CPU (line by line or by function)● Memory (line by line)● Disk read/write (with some hacking)● Network read/write (with some hacking)● mmaps● File handles● Network connections ● Cache utilisation via libperf?

Page 10: The High Performance Python Landscape by Ian Ozsvald

[email protected] @IanOzsvald PyDataLondon February 2014

Cython 0.20 (pyx annotations)#cython: boundscheck=False

def calculate_z(int maxiter, zs, cs):

    """Calculate output list using Julia update rule"""

    cdef unsigned int i, n

    cdef double complex z, c

    output = [0] * len(zs)

    for i in range(len(zs)):

        n = 0

        z = zs[i]

        c = cs[i]

        while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4:

            z = z * z + c

            n += 1

        output[i] = n

    return output

Pure CPython lists code 12sCython lists runtime 0.19s Cython numpy runtime 0.16s

Page 11: The High Performance Python Landscape by Ian Ozsvald

[email protected] @IanOzsvald PyDataLondon February 2014

Cython + numpy + OMP nogil#cython: boundscheck=False

from cython.parallel import parallel, prange

import numpy as np

cimport numpy as np

def calculate_z(int maxiter, double complex[:] zs, double complex[:] cs):

    cdef unsigned int i, length, n

    cdef double complex z, c

    cdef int[:] output = np.empty(len(zs), dtype=np.int32)

    length = len(zs)

    with nogil, parallel():

        for i in prange(length, schedule="guided"):

            z = zs[i]

            c = cs[i]

            n = 0

            while n < maxiter and (z.real * z.real + z.imag * z.imag) < 4:

                z = z * z + c

                n = n + 1

            output[i] = n

    return outputRuntime 0.05s

Page 12: The High Performance Python Landscape by Ian Ozsvald

[email protected] @IanOzsvald PyDataLondon February 2014

ShedSkin 0.9.4 annotationsdef calculate_z(maxiter, zs, cs):        # maxiter: [int], zs: 

                           [list(complex)], cs: [list(complex)]

    output = [0] * len(zs)               # [list(int)]

    for i in range(len(zs)):             # [__iter(int)]

        n = 0                            # [int]

        z = zs[i]                        # [complex]

        c = cs[i]                        # [complex]

        while n < maxiter and (… <4):    # [complex]

            z = z * z + c                # [complex]

            n += 1                       # [int]

        output[i] = n                    # [int]

    return output                        # [list(int)]

Couldn't we generate Cython pyx? Runtime 0.22s

Page 13: The High Performance Python Landscape by Ian Ozsvald

[email protected] @IanOzsvald PyDataLondon February 2014

Pythran (0.40)

#pythran export calculate_z_serial_purepython(int, complex list, complex list)

def calculate_z_serial_purepython(maxiter, zs, cs):

 … 

Support for OpenMP on numpy arraysAuthor Serge made an overnight fix – superb support!

List Runtime 0.4s

#pythran export calculate_z(int, complex[], complex[], int[])

… 

#omp parallel for schedule(dynamic)

OMP numpy Runtime 0.10s

Page 14: The High Performance Python Landscape by Ian Ozsvald

[email protected] @IanOzsvald PyDataLondon February 2014

PyPy nightly (and numpypy)

● “It just works” on Python 2.7 code● Clever list strategies (e.g. unboxed, uniform)● Little support for pre-existing C extensions (e.g. the existing numpy)

● multiprocessing, IPython etc all work fine● Python list code runtime: 0.3s● (pypy)numpy support is incomplete, bugs are tackled (numpy runtime 5s [CPython+numpy 56s])

Page 15: The High Performance Python Landscape by Ian Ozsvald

[email protected] @IanOzsvald PyDataLondon February 2014

Numba 0.12from numba import jit

@jit(nopython=True)

def calculate_z_serial_purepython(maxiter, zs, cs, output):

    # couldn't create output, had to pass it in

    # output = numpy.zeros(len(zs), dtype=np.int32)

    for i in xrange(len(zs)):

        n = 0

        z = zs[i]

        c = cs[i]

        #while n < maxiter and abs(z) < 2:  # abs unrecognised

        while n < maxiter and z.real * z.real + z.imag * z.imag < 4:

            z = z * z + c

            n += 1

        output[i] = n

    #return output

Runtime 0.4sSome Python 3 support, some GPUprange support missing (was in 0.11)?0.12 introduces temp limitations

Page 16: The High Performance Python Landscape by Ian Ozsvald

[email protected] @IanOzsvald PyDataLondon February 2014

Tool Tradeoffs● PyPy no learning curve (pure Py only) easy win?● ShedSkin easy (pure Py only) but fairly rare● Cython pure Py hours to learn – team cost low (and lots of online help)

● Cython numpy OMP days+ to learn – heavy team cost?● Numba/Pythran hours to learn, install a bit tricky (Anaconda easiest for Numba)

● Pythran OMP very impressive result for little effort● Numba big toolchain which might hurt productivity?● (numexpr not covered – great for numpy and easy to use)

Page 17: The High Performance Python Landscape by Ian Ozsvald

[email protected] @IanOzsvald PyDataLondon February 2014

Wrap up

● Our profiling options should be richer● 4-12 physical CPU cores commonplace● Cost of hand-annotating code is reduced agility● JITs/AST compilers are getting fairly good, manual intervention still gives best results

BUT! CONSIDER:● Automation should (probably) be embraced ($CPUs < $humans) as team velocity is probably higher

Page 18: The High Performance Python Landscape by Ian Ozsvald

[email protected] @IanOzsvald PyDataLondon February 2014

Thank You

[email protected]• @IanOzsvald

• MorConsulting.com

• Annotate.io

• GitHub/IanOzsvald