crushing the head of the snake by robert brewer pydata sv 2014

Crushing the Head of the Snake

Robert BrewerChief Architect

Crunch.io

How to Time

from timeit import Timer

>>> range(5)[0, 1, 2, 3, 4]>>> t = Timer("range(a)", "a = 1000000")>>> t.timeit(1)0.028472900390625>>> t.timeit(100)1.8600409030914307>>> t.timeit(1000)18.056041955947876

Comparing algorithms

>>> Timer("range(1000)").timeit(1 000 000)>>> Timer("range(1000)").timeit()11.392634868621826

>>> Timer("xrange(1000)").timeit()0.20040297508239746

>>> Timer("list(xrange(1000))").timeit()12.207480907440186

Caveat: Overhead

>>> Timer().timeit(1000000)0.029289960861206055

Caveat: Wall time not CPU time

>>> Timer("xrange(1000)").timeit()0.20040297508239746>>> Timer("xrange(1000)").repeat(3)[0.20735883712768555, 0.1968221664428711, 0.18882489204406738] take the minimum

How to Profile

>>> import mod>>> import cProfile>>> cProfile.run("mod.b()", sort="cumulative")

How to Profile

>>> import mod>>> import cProfile>>> cProfile.run("mod.b()", sort="cumulative")

(make changes to module)

>>> reload(mod)>>> cProfile.run("mod.b()", sort="cumulative")

How to Profile

>>> cProfile.run("for i in xrange(3000): range(i).sort()", sort="cumulative") 6002 function calls in 0.093 seconds

Ordered by: cumulative time

ncalls tottime percall cumtime percall filename:lineno(func) 1 0.019 0.019 0.093 0.093 <string>:1(<module>) 3000 0.052 0.000 0.052 0.000 {list.sort} 3000 0.022 0.000 0.022 0.000 {range} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}

How to Profile

6002 function calls in 0.093 seconds

ncalls tottime percall cumtime percall filename:lineno(func)

3000 0.052 0.000 0.052 0.000 {list.sort} 3000 0.022 0.000 0.022 0.000 {range}

Example: Standard Deviation

>>> import numpy>>> n = 100>>> a = numpy.array(xrange(n), dtype=float)>>> a.std(ddof=1)29.011491975882016


>>> n = 4 000 000 000>>> a = numpy.array(xrange(n), dtype=float)Traceback (most recent call last): File "<stdin>", line 1, in <module>ValueError: setting an array element with a sequence.


>>> n = 4 000 000 000>>> arr = numpy.zeros(n, dtype=float)Traceback (most recent call last): File "<stdin>", line 1, in <module>MemoryError


Given array A broken in n parts a1...an

and local variance V(ai) = Σj(aij - ai)2

V(a) + 2(Σaij)(ai - A) + |ai|(A2 - ai2)

|A| - ddof

n

Σi = 1√


def run(): points = 400 000 (0000) segments = 100 part_len = points / segments

partitions = [] for p in range(segments): part = range(part_len * p, part_len * (p + 1)) partitions.append(part)

return stddev(partitions, ddof=1)


def stddev(partitions, ddof=0): final = 0.0 for part in partitions: m = total(part) / length(part)

# Find the mean of the entire group. gtotal = total([total(p) for p in partitions]) glength = total([length(p) for p in partitions]) g = gtotal / glength

adj = ((2 * total(part) * (m - g)) + ((g ** 2 - m ** 2) * length(part))) final += varsum(part) + adj

return math.sqrt(final / (glength - ddof))

Example: Standard Deviation2052106 function calls in 71.025 seconds

ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 71.023 71.023 stddev.py:39(run) 1 0.006 0.006 71.013 71.013 stddev.py:22(stddev)410400 63.406 0.000 70.490 0.000 stddev.py:4(total) 100 0.341 0.003 69.178 0.692 stddev.py:15(varsum)410601 7.076 0.000 7.076 0.000 {range}410200 0.151 0.000 0.174 0.000 stddev.py:11(length)820700 0.042 0.000 0.042 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt}


400 000 in 71.025 seconds

Assuming no other effects of scale,it will take 197.3 hours (over 8 days)to calculate our 4 billion-row array.


Can we calculateour 4 billion-row array in

1 minute?

That’s 400,000 in 6 ms.

All we need is an 11,837.5x speedup.

Optimization

Example: Standard Deviation2052106 function calls in 71.025 seconds

ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 71.023 71.023 stddev.py:39(run) 1 0.006 0.006 71.013 71.013 stddev.py:22(stddev)410400 63.406 0.000 70.490 0.000 stddev.py:4(total) 100 0.341 0.003 69.178 0.692 stddev.py:15(varsum)410601 7.076 0.000 7.076 0.000 {range}410200 0.151 0.000 0.174 0.000 stddev.py:11(length)820700 0.042 0.000 0.042 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt}

Amongst Our Weaponry

Extracting loop invariants

Extracting Loop Invariants

def varsum(arr): vs = 0 for j in range(len(arr)): mean = (total(arr) / length(arr)) vs += (arr[j] - mean) ** 2 return vs


def varsum(arr): vs = 0 mean = (total(arr) / length(arr)) for j in range(len(arr)): vs += (arr[j] - mean) ** 2 return vs

Extracting Loop Invariants52606 calls in 1.944 seconds (36x)

ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 1.942 1.942 stddev1.py:41(run) 1 0.006 0.006 1.932 1.932 stddev1.py:23(stddev) 10500 1.673 0.000 1.859 0.000 stddev1.py:4(total) 10701 0.196 0.000 0.196 0.000 {range} 100 0.062 0.001 0.081 0.001 stddev1.py:15(varsum) 10300 0.003 0.000 0.003 0.000 stddev1.py:11(length) 20900 0.001 0.000 0.001 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt}

still 5.4 hrs


def stddev(partitions, ddof=0): final = 0.0

for part in partitions: m = total(part) / length(part)





def stddev(partitions, ddof=0): final = 0.0


for part in partitions: m = total(part) / length(part)



Extracting Loop Invariants2512 function calls in 0.142 seconds (13x)

ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.140 0.140 stddev1.py:42(run) 1 0.000 0.000 0.136 0.136 stddev1.py:23(stddev) 100 0.063 0.001 0.082 0.001 stddev1.py:15(varsum) 402 0.064 0.000 0.071 0.000 stddev1.py:4(total) 603 0.013 0.000 0.013 0.000 {range} 400 0.000 0.000 0.000 0.000 stddev1.py:11(length) 902 0.000 0.000 0.000 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt}

still 23 minutes


Use builtin Python functionswhenever possible

Use Python Builtins

def total(arr): s = 0 for j in range(len(arr)): s += arr[j] return s

Use Python Builtins

def total(arr): s = 0 for j in range(len(arr)): s += arr[j] return s

def total(arr): return sum(arr)

Use Python Builtins2110 function calls in 0.096 seconds (1.47x)ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.093 0.093 stddev1.py:39(run) 1 0.000 0.000 0.083 0.083 stddev1.py:20(stddev) 100 0.065 0.001 0.070 0.001 stddev1.py:12(varsum) 402 0.000 0.000 0.015 0.000 stddev1.py:4(total) 402 0.015 0.000 0.015 0.000 {sum} 201 0.012 0.000 0.012 0.000 {range} 400 0.000 0.000 0.000 0.000 stddev1.py:8(length) 500 0.000 0.000 0.000 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt}

still 16 minutes

Use Python Builtins2110 function calls in 0.096 seconds (1.47x)ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.093 0.093 stddev1.py:39(run) 1 0.000 0.000 0.083 0.083 stddev1.py:20(stddev) 100 0.065 0.001 0.070 0.001 stddev1.py:12(varsum) 402 0.000 0.000 0.015 0.000 stddev1.py:4(total) 402 0.015 0.000 0.015 0.000 {sum} 201 0.012 0.000 0.012 0.000 {range} 400 0.000 0.000 0.000 0.000 stddev1.py:8(length) 500 0.000 0.000 0.000 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt}

Use Python Builtins

def varsum(arr): vs = 0 mean = (total(arr) / length(arr)) for j in range(len(arr)): vs += (arr[j] - mean) ** 2 return vs

Use Python Builtins

def varsum(arr):

mean = (total(arr) / length(arr)) return sum((v - mean) ** 2 for v in arr)

Use Python Builtins

402110 function calls in 0.122 seconds1.27x slower

ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.120 0.120 stddev.py:36(run) 1 0.000 0.000 0.115 0.115 stddev.py:17(stddev) 502 0.044 0.000 0.114 0.000 {sum} 100 0.000 0.000 0.106 0.001 stddev.py:12(varsum)400100 0.070 0.000 0.070 0.000 stddev.py:14(genexpr) 402 0.000 0.000 0.011 0.000 stddev.py:4(total)

…


Reduce function calls

Reduce Function Calls>>> Timer("sum(a)", "a = range(10)").repeat(3)[0.15801000595092773, 0.1406857967376709, 0.14577603340148926]

>>> Timer("total(a)", "a = range(10); total = lambda x: sum(x)" ).repeat(3)[0.2066800594329834, 0.1998300552368164, 0.21536493301391602]

0.000 000 059 seconds per call

Reduce Function Calls

def variances_squared(arr): mean = (total(arr) / length(arr)) for v in arr: yield (v - mean) ** 2

Reduce Function Calls

def varsum(arr): mean = (total(arr) / length(arr)) return sum( (v - mean) ** 2 for v in arr )

def varsum(arr): mean = (total(arr) / length(arr)) return sum([(v - mean) ** 2 for v in arr])

Reduce Function Calls2010 function calls in 0.082 seconds (1.17x)ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.080 0.080 stddev.py:36(run) 1 0.000 0.000 0.071 0.071 stddev.py:17(stddev) 100 0.050 0.001 0.056 0.001 stddev.py:12(varsum) 502 0.020 0.000 0.020 0.000 {sum} 402 0.000 0.000 0.016 0.000 stddev.py:4(total) 101 0.009 0.000 0.009 0.000 {range} 400 0.000 0.000 0.000 0.000 stddev.py:8(length) 400 0.000 0.000 0.000 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt}

still 13+ minutes


Vector operationswith NumPy

Vector Operations

part = numpy.array( xrange(...), dtype=float)

def total(arr): return arr.sum()

def varsum(arr): return ( (arr - arr.mean()) ** 2).sum()

Vector Operations3408 function calls in 0.057 seconds (1.43x)

ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.057 0.057 stddev1.py:37(run) 200 0.051 0.000 0.051 0.000 {numpy...array} 1 0.001 0.001 0.006 0.006 stddev1.py:18(stddev) 500 0.003 0.000 0.003 0.000 {numpy.ufunc.reduce} 100 0.001 0.000 0.003 0.000 stddev1.py:14(varsum) 400 0.000 0.000 0.003 0.000 {numpy.ndarray.sum} 300 0.000 0.000 0.002 0.000 stddev1.py:6(total) 100 0.000 0.000 0.001 0.000 {numpy.ndarray.mean}

…

still 9.5 minutes

Vector Operations3408 function calls in 0.006 seconds (13.6x)

ncalls tottime percall cumtime percall filename:lineno(func)

1 0.001 0.001 0.006 0.006 stddev1.py:18(stddev) 500 0.003 0.000 0.003 0.000 {numpy.ufunc.reduce} 100 0.001 0.000 0.003 0.000 stddev1.py:14(varsum) 400 0.000 0.000 0.003 0.000 {numpy.ndarray.sum} 300 0.000 0.000 0.002 0.000 stddev1.py:6(total) 100 0.000 0.000 0.001 0.000 {numpy.ndarray.mean}

…

should be exactly 1 minute

Vector Operations

Let’s try 4 billion!

Bump up that N...

Vector Operations

MemoryError

Oh, yeah...


Parallelizationwith

multiprocessing

Parallelization

from multiprocessing import Pool

def run(): results = Pool().map( run_one, range(segments)) result = stddev(results) return result

Parallelization

def run_one(i): p = numpy.memmap( 'stddev.%d' % i, dtype=float, mode='r', shape=(part_len,))

T, L = p.sum(), float(len(p)) m = T / L V = ((p - m) ** 2).sum() return T, L, V

Parallelization

def stddev(TLVs, ddof=0): final = 0.0

totals = [T for T, L, V in TLVs] lengths = [L for T, L, V in TLVs] glength = sum(lengths) g = sum(totals) / glength

for T, L, V in TLVs: m = T / L adj = ((2 * T * (m - g)) + ((g ** 2 - m ** 2) * L)) final += V + adj


Parallelization3734 function calls in 0.024 seconds

6x slower

ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.024 0.024 stddev.py:47(run) 4 0.000 0.000 0.011 0.003 threading.py:234(wait) 22 0.011 0.000 0.011 0.000 {thread.lock.acquire} 1 0.000 0.000 0.011 0.011 pool.py:222(map) 1 0.000 0.000 0.008 0.008 pool.py:113(__init__) 4 0.001 0.000 0.005 0.001 process.py:116(start) 1 0.003 0.003 0.005 0.005 stddev.py:11(stddev) 4 0.000 0.000 0.004 0.001 forking.py:115(init) 4 0.003 0.001 0.003 0.001 {posix.fork}

...

Parallelization

Could that waiting be insignificantwhen we scale up to 4 billion?

Let’s try it!

Parallelization3766 function calls in 67.811 seconds

ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 67.811 67.811 stddev.py:47(run) 4 0.000 0.000 67.747 16.930 threading.py:234(wait) 22 67.747 3.079 67.747 3.079 {thread.lock.acquire} 1 0.000 0.000 67.747 67.747 pool.py:222(map) 1 0.000 0.000 0.062 0.060 pool.py:113(__init__) 4 0.000 0.000 0.058 0.014 process.py:116(start) 4 0.057 0.014 0.057 0.014 {posix.fork} 1 0.003 0.003 0.005 0.005 stddev.py:11(stddev) 2 0.002 0.001 0.002 0.001 {sum}

SO CLOSE! 1.13 minutes

Parallelization

def run_one(i): if i == 50: cProfile.runctx(..., "prf.50")

>>> import pstats>>> s = pstats.Stats("prf.50")>>> s.sort_stats("cumulative")<pstats.Stats instance at 0x2bddcb0>>>> _.print_stats()

Parallelization

57 function calls in 2.804 seconds

ncalls tottime percall cumtime percall filename:lineno(func) 1 0.431 0.431 2.791 2.791 stddev.py:43(run_one) 2 0.000 0.000 2.360 1.180 numpy.ndarray.sum 2 2.360 1.180 2.360 1.180 numpy.ufunc.reduce 1 0.000 0.000 0.000 0.000 memmap.py:195(__new__)

Parallelization

def run_one(i): p = numpy.memmap( 'stddev.%d' % i, dtype=float, mode='r', shape=(part_len,))

T, L = p.sum(), float(len(p)) m = T / L V = ((p - m) ** 2).sum() return T, L, V

200 seconds / 4 cores = 50

Parallelization? Serialization!

67.8 seconds for 4 billion rows, but-50 of those are loading data! 17.8 seconds to do the actual math.

Serialization

import bloscpack as bpbargs = bp.args.DEFAULT_BLOSC_ARGSbargs['clevel'] = 6

bp.pack_ndarray_file( part, fname, blosc_args=bargs)

part = bp.unpack_ndarray_file(fname)

Serialization

Let’s try it!

I Crush Your

Head!

I Crush Your Head!1153 function calls in 26.166 seconds

ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 26.166 26.166 stddev_bp.py:56(run) 4 0.000 0.000 26.134 6.53 threading.py:234(wait) 22 26.134 1.188 26.134 1.188 thread.lock.acquire 1 0.000 0.000 26.133 26.133 pool.py:222(map) 1 0.000 0.000 26.133 26.133 pool.py:521(get) 1 0.000 0.000 26.133 26.133 pool.py:513(wait) 1 0.003 0.003 0.030 0.030 __init__.py:227(Pool) 1 0.000 0.000 0.021 0.021 pool.py:113(__init__)

I Crush Your Head!

With some time-tested generalprogramming techniques:

Extract loop invariants

Use language builtins

Reduce function calls

I Crush Your Head!

And some Python librariesfor architectural improvements:

Use NumPy for vector ops

Use multiprocessing for parallelization

Use bloscpack for compression

I Crush Your Head!

We sped up our calculationso that it runs in:

0.003% of the time

or 27317 times faster

4.4 orders of magnitude

Crushing the Head of the Snake

Any questions?

@[email protected]

crushing the head of the snake by robert brewer pydata sv 2014

Technology

standard deviation n

loop invariants

return vs

partitions glength

ddof n i

timeit import timer

points segments partitions

totalarr lengtharr vs