crushing the head of the snake by robert brewer pydata sv 2014
DESCRIPTION
Big Data brings with it particular challenges in any language, mostly in performance. This talk will explain how to get immediate speedups in your Python code by exploiting both timeless programming techniques and fixes specific to Python. We will cover: I. Amongst Our Weaponry 1. How to Time and Profile Python 2. Extracting Loop invariants: constants, lookup tables, even methods! 3. Caching: memoization and heavier things II Gunfight at the O.K. Corral in Morse Code 1. Python functions vs C functions 2. Vector operations: NumPy 3. Reducing calls: loops, generators, recursion III. The Semaphore Version of Wuthering Heights 1. Using select instead of Queue 2. Serialization overhead 3. Parallelizing workTRANSCRIPT
Crushing the Head of the Snake
Robert BrewerChief Architect
Crunch.io
How to Time
from timeit import Timer
>>> range(5)[0, 1, 2, 3, 4]>>> t = Timer("range(a)", "a = 1000000")>>> t.timeit(1)0.028472900390625>>> t.timeit(100)1.8600409030914307>>> t.timeit(1000)18.056041955947876
Comparing algorithms
>>> Timer("range(1000)").timeit(1 000 000)>>> Timer("range(1000)").timeit()11.392634868621826
>>> Timer("xrange(1000)").timeit()0.20040297508239746
>>> Timer("list(xrange(1000))").timeit()12.207480907440186
Caveat: Overhead
>>> Timer().timeit(1000000)0.029289960861206055
Caveat: Wall time not CPU time
>>> Timer("xrange(1000)").timeit()0.20040297508239746>>> Timer("xrange(1000)").repeat(3)[0.20735883712768555, 0.1968221664428711, 0.18882489204406738] take the minimum
How to Profile
>>> import mod>>> import cProfile>>> cProfile.run("mod.b()", sort="cumulative")
How to Profile
>>> import mod>>> import cProfile>>> cProfile.run("mod.b()", sort="cumulative")
(make changes to module)
>>> reload(mod)>>> cProfile.run("mod.b()", sort="cumulative")
How to Profile
>>> cProfile.run("for i in xrange(3000): range(i).sort()", sort="cumulative") 6002 function calls in 0.093 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(func) 1 0.019 0.019 0.093 0.093 <string>:1(<module>) 3000 0.052 0.000 0.052 0.000 {list.sort} 3000 0.022 0.000 0.022 0.000 {range} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
How to Profile
6002 function calls in 0.093 seconds
ncalls tottime percall cumtime percall filename:lineno(func)
3000 0.052 0.000 0.052 0.000 {list.sort} 3000 0.022 0.000 0.022 0.000 {range}
Example: Standard Deviation
>>> import numpy>>> n = 100>>> a = numpy.array(xrange(n), dtype=float)>>> a.std(ddof=1)29.011491975882016
Example: Standard Deviation
>>> n = 4 000 000 000>>> a = numpy.array(xrange(n), dtype=float)Traceback (most recent call last): File "<stdin>", line 1, in <module>ValueError: setting an array element with a sequence.
Example: Standard Deviation
>>> n = 4 000 000 000>>> arr = numpy.zeros(n, dtype=float)Traceback (most recent call last): File "<stdin>", line 1, in <module>MemoryError
Example: Standard Deviation
Example: Standard Deviation
Given array A broken in n parts a1...an
and local variance V(ai) = Σj(aij - ai)2
V(a) + 2(Σaij)(ai - A) + |ai|(A2 - ai2)
|A| - ddof
n
Σi = 1√
Example: Standard Deviation
def run(): points = 400 000 (0000) segments = 100 part_len = points / segments
partitions = [] for p in range(segments): part = range(part_len * p, part_len * (p + 1)) partitions.append(part)
return stddev(partitions, ddof=1)
Example: Standard Deviation
def stddev(partitions, ddof=0): final = 0.0 for part in partitions: m = total(part) / length(part)
# Find the mean of the entire group. gtotal = total([total(p) for p in partitions]) glength = total([length(p) for p in partitions]) g = gtotal / glength
adj = ((2 * total(part) * (m - g)) + ((g ** 2 - m ** 2) * length(part))) final += varsum(part) + adj
return math.sqrt(final / (glength - ddof))
Example: Standard Deviation2052106 function calls in 71.025 seconds
ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 71.023 71.023 stddev.py:39(run) 1 0.006 0.006 71.013 71.013 stddev.py:22(stddev)410400 63.406 0.000 70.490 0.000 stddev.py:4(total) 100 0.341 0.003 69.178 0.692 stddev.py:15(varsum)410601 7.076 0.000 7.076 0.000 {range}410200 0.151 0.000 0.174 0.000 stddev.py:11(length)820700 0.042 0.000 0.042 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt}
Example: Standard Deviation
400 000 in 71.025 seconds
Assuming no other effects of scale,it will take 197.3 hours (over 8 days)to calculate our 4 billion-row array.
Example: Standard Deviation
Can we calculateour 4 billion-row array in
1 minute?
That’s 400,000 in 6 ms.
All we need is an 11,837.5x speedup.
Optimization
Example: Standard Deviation2052106 function calls in 71.025 seconds
ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 71.023 71.023 stddev.py:39(run) 1 0.006 0.006 71.013 71.013 stddev.py:22(stddev)410400 63.406 0.000 70.490 0.000 stddev.py:4(total) 100 0.341 0.003 69.178 0.692 stddev.py:15(varsum)410601 7.076 0.000 7.076 0.000 {range}410200 0.151 0.000 0.174 0.000 stddev.py:11(length)820700 0.042 0.000 0.042 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt}
Amongst Our Weaponry
Extracting loop invariants
Extracting Loop Invariants
def varsum(arr): vs = 0 for j in range(len(arr)): mean = (total(arr) / length(arr)) vs += (arr[j] - mean) ** 2 return vs
Extracting Loop Invariants
def varsum(arr): vs = 0 mean = (total(arr) / length(arr)) for j in range(len(arr)): vs += (arr[j] - mean) ** 2 return vs
Extracting Loop Invariants52606 calls in 1.944 seconds (36x)
ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 1.942 1.942 stddev1.py:41(run) 1 0.006 0.006 1.932 1.932 stddev1.py:23(stddev) 10500 1.673 0.000 1.859 0.000 stddev1.py:4(total) 10701 0.196 0.000 0.196 0.000 {range} 100 0.062 0.001 0.081 0.001 stddev1.py:15(varsum) 10300 0.003 0.000 0.003 0.000 stddev1.py:11(length) 20900 0.001 0.000 0.001 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt}
still 5.4 hrs
Extracting Loop Invariants
def stddev(partitions, ddof=0): final = 0.0
for part in partitions: m = total(part) / length(part)
# Find the mean of the entire group. gtotal = total([total(p) for p in partitions]) glength = total([length(p) for p in partitions]) g = gtotal / glength
adj = ((2 * total(part) * (m - g)) + ((g ** 2 - m ** 2) * length(part))) final += varsum(part) + adj
return math.sqrt(final / (glength - ddof))
Extracting Loop Invariants
def stddev(partitions, ddof=0): final = 0.0
# Find the mean of the entire group. gtotal = total([total(p) for p in partitions]) glength = total([length(p) for p in partitions]) g = gtotal / glength
for part in partitions: m = total(part) / length(part)
adj = ((2 * total(part) * (m - g)) + ((g ** 2 - m ** 2) * length(part))) final += varsum(part) + adj
return math.sqrt(final / (glength - ddof))
Extracting Loop Invariants2512 function calls in 0.142 seconds (13x)
ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.140 0.140 stddev1.py:42(run) 1 0.000 0.000 0.136 0.136 stddev1.py:23(stddev) 100 0.063 0.001 0.082 0.001 stddev1.py:15(varsum) 402 0.064 0.000 0.071 0.000 stddev1.py:4(total) 603 0.013 0.000 0.013 0.000 {range} 400 0.000 0.000 0.000 0.000 stddev1.py:11(length) 902 0.000 0.000 0.000 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt}
still 23 minutes
Amongst Our Weaponry
Use builtin Python functionswhenever possible
Use Python Builtins
def total(arr): s = 0 for j in range(len(arr)): s += arr[j] return s
Use Python Builtins
def total(arr): s = 0 for j in range(len(arr)): s += arr[j] return s
def total(arr): return sum(arr)
Use Python Builtins2110 function calls in 0.096 seconds (1.47x)ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.093 0.093 stddev1.py:39(run) 1 0.000 0.000 0.083 0.083 stddev1.py:20(stddev) 100 0.065 0.001 0.070 0.001 stddev1.py:12(varsum) 402 0.000 0.000 0.015 0.000 stddev1.py:4(total) 402 0.015 0.000 0.015 0.000 {sum} 201 0.012 0.000 0.012 0.000 {range} 400 0.000 0.000 0.000 0.000 stddev1.py:8(length) 500 0.000 0.000 0.000 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt}
still 16 minutes
Use Python Builtins2110 function calls in 0.096 seconds (1.47x)ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.093 0.093 stddev1.py:39(run) 1 0.000 0.000 0.083 0.083 stddev1.py:20(stddev) 100 0.065 0.001 0.070 0.001 stddev1.py:12(varsum) 402 0.000 0.000 0.015 0.000 stddev1.py:4(total) 402 0.015 0.000 0.015 0.000 {sum} 201 0.012 0.000 0.012 0.000 {range} 400 0.000 0.000 0.000 0.000 stddev1.py:8(length) 500 0.000 0.000 0.000 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt}
Use Python Builtins
def varsum(arr): vs = 0 mean = (total(arr) / length(arr)) for j in range(len(arr)): vs += (arr[j] - mean) ** 2 return vs
Use Python Builtins
def varsum(arr):
mean = (total(arr) / length(arr)) return sum((v - mean) ** 2 for v in arr)
Use Python Builtins
402110 function calls in 0.122 seconds1.27x slower
ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.120 0.120 stddev.py:36(run) 1 0.000 0.000 0.115 0.115 stddev.py:17(stddev) 502 0.044 0.000 0.114 0.000 {sum} 100 0.000 0.000 0.106 0.001 stddev.py:12(varsum)400100 0.070 0.000 0.070 0.000 stddev.py:14(genexpr) 402 0.000 0.000 0.011 0.000 stddev.py:4(total)
…
Amongst Our Weaponry
Reduce function calls
Reduce Function Calls>>> Timer("sum(a)", "a = range(10)").repeat(3)[0.15801000595092773, 0.1406857967376709, 0.14577603340148926]
>>> Timer("total(a)", "a = range(10); total = lambda x: sum(x)" ).repeat(3)[0.2066800594329834, 0.1998300552368164, 0.21536493301391602]
0.000 000 059 seconds per call
Reduce Function Calls
def variances_squared(arr): mean = (total(arr) / length(arr)) for v in arr: yield (v - mean) ** 2
Reduce Function Calls
def varsum(arr): mean = (total(arr) / length(arr)) return sum( (v - mean) ** 2 for v in arr )
def varsum(arr): mean = (total(arr) / length(arr)) return sum([(v - mean) ** 2 for v in arr])
Reduce Function Calls2010 function calls in 0.082 seconds (1.17x)ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.080 0.080 stddev.py:36(run) 1 0.000 0.000 0.071 0.071 stddev.py:17(stddev) 100 0.050 0.001 0.056 0.001 stddev.py:12(varsum) 502 0.020 0.000 0.020 0.000 {sum} 402 0.000 0.000 0.016 0.000 stddev.py:4(total) 101 0.009 0.000 0.009 0.000 {range} 400 0.000 0.000 0.000 0.000 stddev.py:8(length) 400 0.000 0.000 0.000 0.000 {len} 100 0.000 0.000 0.000 0.000 {list.append} 1 0.000 0.000 0.000 0.000 {math.sqrt}
still 13+ minutes
Amongst Our Weaponry
Vector operationswith NumPy
Vector Operations
part = numpy.array( xrange(...), dtype=float)
def total(arr): return arr.sum()
def varsum(arr): return ( (arr - arr.mean()) ** 2).sum()
Vector Operations3408 function calls in 0.057 seconds (1.43x)
ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.057 0.057 stddev1.py:37(run) 200 0.051 0.000 0.051 0.000 {numpy...array} 1 0.001 0.001 0.006 0.006 stddev1.py:18(stddev) 500 0.003 0.000 0.003 0.000 {numpy.ufunc.reduce} 100 0.001 0.000 0.003 0.000 stddev1.py:14(varsum) 400 0.000 0.000 0.003 0.000 {numpy.ndarray.sum} 300 0.000 0.000 0.002 0.000 stddev1.py:6(total) 100 0.000 0.000 0.001 0.000 {numpy.ndarray.mean}
…
still 9.5 minutes
Vector Operations3408 function calls in 0.057 seconds (1.43x)
ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.057 0.057 stddev1.py:37(run) 200 0.051 0.000 0.051 0.000 {numpy...array} 1 0.001 0.001 0.006 0.006 stddev1.py:18(stddev) 500 0.003 0.000 0.003 0.000 {numpy.ufunc.reduce} 100 0.001 0.000 0.003 0.000 stddev1.py:14(varsum) 400 0.000 0.000 0.003 0.000 {numpy.ndarray.sum} 300 0.000 0.000 0.002 0.000 stddev1.py:6(total) 100 0.000 0.000 0.001 0.000 {numpy.ndarray.mean}
…
still 9.5 minutes
Vector Operations3408 function calls in 0.006 seconds (13.6x)
ncalls tottime percall cumtime percall filename:lineno(func)
1 0.001 0.001 0.006 0.006 stddev1.py:18(stddev) 500 0.003 0.000 0.003 0.000 {numpy.ufunc.reduce} 100 0.001 0.000 0.003 0.000 stddev1.py:14(varsum) 400 0.000 0.000 0.003 0.000 {numpy.ndarray.sum} 300 0.000 0.000 0.002 0.000 stddev1.py:6(total) 100 0.000 0.000 0.001 0.000 {numpy.ndarray.mean}
…
should be exactly 1 minute
Vector Operations
Let’s try 4 billion!
Bump up that N...
Vector Operations
MemoryError
Oh, yeah...
Amongst Our Weaponry
Parallelizationwith
multiprocessing
Parallelization
from multiprocessing import Pool
def run(): results = Pool().map( run_one, range(segments)) result = stddev(results) return result
Parallelization
def run_one(i): p = numpy.memmap( 'stddev.%d' % i, dtype=float, mode='r', shape=(part_len,))
T, L = p.sum(), float(len(p)) m = T / L V = ((p - m) ** 2).sum() return T, L, V
Parallelization
def stddev(TLVs, ddof=0): final = 0.0
totals = [T for T, L, V in TLVs] lengths = [L for T, L, V in TLVs] glength = sum(lengths) g = sum(totals) / glength
for T, L, V in TLVs: m = T / L adj = ((2 * T * (m - g)) + ((g ** 2 - m ** 2) * L)) final += V + adj
return math.sqrt(final / (glength - ddof))
Parallelization3734 function calls in 0.024 seconds
6x slower
ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 0.024 0.024 stddev.py:47(run) 4 0.000 0.000 0.011 0.003 threading.py:234(wait) 22 0.011 0.000 0.011 0.000 {thread.lock.acquire} 1 0.000 0.000 0.011 0.011 pool.py:222(map) 1 0.000 0.000 0.008 0.008 pool.py:113(__init__) 4 0.001 0.000 0.005 0.001 process.py:116(start) 1 0.003 0.003 0.005 0.005 stddev.py:11(stddev) 4 0.000 0.000 0.004 0.001 forking.py:115(init) 4 0.003 0.001 0.003 0.001 {posix.fork}
...
Parallelization
Could that waiting be insignificantwhen we scale up to 4 billion?
Let’s try it!
Parallelization3766 function calls in 67.811 seconds
ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 67.811 67.811 stddev.py:47(run) 4 0.000 0.000 67.747 16.930 threading.py:234(wait) 22 67.747 3.079 67.747 3.079 {thread.lock.acquire} 1 0.000 0.000 67.747 67.747 pool.py:222(map) 1 0.000 0.000 0.062 0.060 pool.py:113(__init__) 4 0.000 0.000 0.058 0.014 process.py:116(start) 4 0.057 0.014 0.057 0.014 {posix.fork} 1 0.003 0.003 0.005 0.005 stddev.py:11(stddev) 2 0.002 0.001 0.002 0.001 {sum}
SO CLOSE! 1.13 minutes
Parallelization
def run_one(i): if i == 50: cProfile.runctx(..., "prf.50")
>>> import pstats>>> s = pstats.Stats("prf.50")>>> s.sort_stats("cumulative")<pstats.Stats instance at 0x2bddcb0>>>> _.print_stats()
Parallelization
57 function calls in 2.804 seconds
ncalls tottime percall cumtime percall filename:lineno(func) 1 0.431 0.431 2.791 2.791 stddev.py:43(run_one) 2 0.000 0.000 2.360 1.180 numpy.ndarray.sum 2 2.360 1.180 2.360 1.180 numpy.ufunc.reduce 1 0.000 0.000 0.000 0.000 memmap.py:195(__new__)
Parallelization
def run_one(i): p = numpy.memmap( 'stddev.%d' % i, dtype=float, mode='r', shape=(part_len,))
T, L = p.sum(), float(len(p)) m = T / L V = ((p - m) ** 2).sum() return T, L, V
200 seconds / 4 cores = 50
Parallelization? Serialization!
67.8 seconds for 4 billion rows, but-50 of those are loading data! 17.8 seconds to do the actual math.
Serialization
import bloscpack as bpbargs = bp.args.DEFAULT_BLOSC_ARGSbargs['clevel'] = 6
bp.pack_ndarray_file( part, fname, blosc_args=bargs)
part = bp.unpack_ndarray_file(fname)
Serialization
Let’s try it!
I Crush Your
Head!
I Crush Your Head!1153 function calls in 26.166 seconds
ncalls tottime percall cumtime percall filename:lineno(func) 1 0.000 0.000 26.166 26.166 stddev_bp.py:56(run) 4 0.000 0.000 26.134 6.53 threading.py:234(wait) 22 26.134 1.188 26.134 1.188 thread.lock.acquire 1 0.000 0.000 26.133 26.133 pool.py:222(map) 1 0.000 0.000 26.133 26.133 pool.py:521(get) 1 0.000 0.000 26.133 26.133 pool.py:513(wait) 1 0.003 0.003 0.030 0.030 __init__.py:227(Pool) 1 0.000 0.000 0.021 0.021 pool.py:113(__init__)
I Crush Your Head!
With some time-tested generalprogramming techniques:
Extract loop invariants
Use language builtins
Reduce function calls
I Crush Your Head!
And some Python librariesfor architectural improvements:
Use NumPy for vector ops
Use multiprocessing for parallelization
Use bloscpack for compression
I Crush Your Head!
We sped up our calculationso that it runs in:
0.003% of the time
or 27317 times faster
4.4 orders of magnitude