london level39
DESCRIPTION
Talk about Python and "Big Data" at Level39 in Canary Wharf in London on September 30, 2013.TRANSCRIPT
![Page 1: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/1.jpg)
Python in the future of “Big Data” analytics
Travis Oliphant, PhDContinuum Analytics, Inc
September 30, 2013London, UK
![Page 2: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/2.jpg)
Beginnings
AfterBefore
⇢0 (2⇡f)2 Ui (a, f) = [Cijkl (a, f)Uk,l (a, f)],j
![Page 3: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/3.jpg)
Python origins.Version Date
0.9.0 Feb. 1991
0.9.4 Dec. 1991
0.9.6 Apr. 1992
0.9.8 Jan. 1993
1.0.0 Jan. 1994
1.2 Apr. 1995
1.4 Oct. 1996
1.5.2 Apr. 1999
http://python-history.blogspot.com/2009/01/brief-timeline-of-python.html
![Page 4: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/4.jpg)
A sample of users
![Page 5: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/5.jpg)
Why PythonLicense
Community
Readable Syntax
Modern Constructs
Batteries Included
Free and Open Source, Permissive License
• Broad and friendly community• Over 34,000 packages on PyPI• Commercial Support• Many conferences (PyData, SciPy, PyCons...)
• Executable pseudo-code• Can understand and edit code a year later• Fun to develop• Use of Indentation
IPython
• Interactive prompt on steroids• Allows less working memory • Allows failing quickly for exploration
• List comprehensions• Iterator protocol and generators• Meta-programming• Introspection• (JIT Compiler and Concurrency)
• Internet (FTP, HTTP, SMTP, XMLRPC)• Compression and Databases• Logging, unit-tests• Glue for other languages• Distribution has much, much more....
![Page 6: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/6.jpg)
Python supports a developer spectrum
DeveloperOccasional Scientist Developer
• Cut and paste• Modify a few variables• Call some functions• Typical Quant or
Engineer who doesn’t become programmer
• Extend frameworks• Builds new objects• Wraps code• Quant / Engineer with
decent developer skill
• Creates frameworks• Creates compilers• Typical CS grad• Knows multiple
languages
Unique aspect of Python
![Page 7: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/7.jpg)
1999 : Early SciPy emergesDiscussions on the matrix-sig from 1997 to 1999 wanting a complete data analysis
environment: Paul Barrett, Joe Harrington, Perry Greenfield, Paul Dubois, Konrad Hinsen, and others. Activity in 1998, led to increased interest in 1999.
In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be present and began wrapping / writing in earnest. On 6 April 1999, I announced I would
be creating this uber-package which eventually became SciPy in 2001.
Gaussian quadrature 5 Jan 1999
cephes 1.0 30 Jan 1999
sigtools 0.40 23 Feb 1999
Numeric docs March 1999
cephes 1.1 9 Mar 1999
multipack 0.3 13 Apr 1999
Helper routines 14 Apr 1999
multipack 0.6 (leastsq, ode, fsolve, quad)
29 Apr 1999
sparse plan described 30 May 1999
multipack 0.7 14 Jun 1999
SparsePy 0.1 5 Nov 1999
cephes 1.2 (vectorize) 29 Dec 1999
Plotting??
GistXPLOTDISLINGnuplot
Helping with f2py
![Page 8: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/8.jpg)
Brief History
Person Package Year
Jim Fulton Matrix Object in Python
1994
Jim Hugunin Numeric 1995
Perry Greenfield, Rick White, Todd Miller Numarray 2001
Travis Oliphant NumPy 2005
![Page 9: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/9.jpg)
Community effortmany, many others!
• Chuck Harris• Pauli Virtanen• Nathaniel Smith• Warren Weckesser• Ralf Gommers• Robert Kern• David Cournapeau• Stefan van der Walt• Jake Vanderplas• Josef Perktold• Anne Archibald• Dag Sverre Seljebotn• Joe Harrington --- Documentation effort• Andrew Straw --- www.scipy.org
![Page 10: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/10.jpg)
About 2,000,000 users of NumPy!
![Page 11: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/11.jpg)
Scientific Stack
NumPy
SciPy Pandas Matplotlib
scikit-learnscikit-image statsmodels
PyTables
OpenCV
Cython
Numba SymPy NumExpr
astropy BioPython GDALPySAL
... many many more ...
![Page 12: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/12.jpg)
Now What?
After watching NumPy and SciPy get used all over Science and Technology (including Finance) --- what
would I do differently?
BlazeNumba
Conda (Anaconda)
![Page 13: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/13.jpg)
Continuum began operations in January of 2012
Python
Travis Oliphant Peter Wang
![Page 14: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/14.jpg)
(Most of) Our TeamScientists Developers Business
NumFOCUS
![Page 15: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/15.jpg)
expertise
Big Picture
![Page 16: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/16.jpg)
We are big backers of NumFOCUS and organizers of PyData
Spyder
![Page 17: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/17.jpg)
How we pay the bills
Enterprise
Python
Scientific
Computing
Data Processing
Data Analysis
Visualisation
Scalable
Computing
• Products• Training• Support• Consulting
![Page 18: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/18.jpg)
“Big Data” and the Hype Cycle
![Page 19: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/19.jpg)
Advanced Analytics and HPC
HPCSupercomputing
HSCFault ToleranceErasure CodingHadoop / Disco
MPIBig-Compute
ScalapackTrilinosPETScGPUs
?Python
![Page 20: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/20.jpg)
Python and Science
Python is the “language of Science”(Lots of R users might disagree)
IPython notebook is quickly becoming the way scientists communicate about their work
Pandas has recently started converting even R users to Python
![Page 21: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/21.jpg)
The problem of Hadoop
Hadoop wants to be the OS for “big-data”. Advanced analytics and Hadoop don’t blend well.
Many people (led by hype) use Hadoop when they don’t need to --- and it slows them down and costs them $$. Scale up first. Then, scale-out.
http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html
“Don’t use Hadoop --- your data is not that big”
![Page 22: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/22.jpg)
Options if you do need Hadoop
• Give Disco a try
• Try a non Java-specific emerging alternative to HDFS (OrangeFS, GlusterFS, CephFS, Swift)
• Use Python wrapper to HDFS (snakebite, webHDFS) and interface to map-reduce (luigi, mrjob, MortarData CPython UDF etc.)
![Page 23: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/23.jpg)
“Data Has Mass”
http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/
![Page 24: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/24.jpg)
WorkflowPerspective
![Page 25: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/25.jpg)
WorkflowPerspective
Data-centricPerspective
![Page 26: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/26.jpg)
The largest data analysis gap is in this man-machine interface. How can we put the scientist back in control of his data? How can we build analysis tools that are intuitive and that augment the scientist’s intellect rather than adding to the intellectual burden with a forest of arcane user tools? The real challenge is building this smart notebook that unlocks the data and makes it easy to capture, organize, analyze, visualize, and publish.
-- Jim Gray et al, 2005
![Page 27: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/27.jpg)
Why Don’t Scientists Use DBs?
• Do not support scientific data types, or access patterns particular to a scientific problem
• Scientists can handle their existing data volumes using programming tools
• Once data was loaded, could not manipulate it with standard/familiar programs
• Poor visualization and plotting integration
• Require an expensive guru to maintain
![Page 28: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/28.jpg)
“If one takes the controversial view that HDF, NetCDF, FITS, and Root are nascent database systems that provide metadata and portability but lack non-procedural query analysis, automatic parallelism, and sophisticated indexing, then one can see a fairly clear path that integrates these communities.”
Convergence
![Page 29: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/29.jpg)
Key Question
How do we move code to data, while avoiding data silos?
![Page 30: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/30.jpg)
Continuum key OS technologies
Conda
Browser-based interactive visualization for Python users
Cross-platform package manager (with environments)
Array-oriented Python Compiler for CPUs and GPUs (speed target is Fortran)Numba
Blaze
Bokeh
CDX
NumPy and Pandas for out-of-core and distributed data (general data-base execution engine for data-flow subset of Python)
Continuum Data Explorer
Ashiba
New web-app building with only Python and a little HTML
![Page 31: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/31.jpg)
Our Emerging Platform
Rapid App Platform for SMEs
WakariAnaconda
Binstar
![Page 32: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/32.jpg)
What is Conda
• Full package management (like yum or apt-get) but cross-platform
• Control over environments (using link farms) --- better than virtual-env. virtualenv today is like distutils and setuptools of several years ago (great at first but will end up hating it)
• Architected to be able to manage any packages (R, Scala, Clojure, Haskell, Ruby, JS)
• SAT solver to manage dependencies• User-definable repositories
![Page 33: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/33.jpg)
Binstar
![Page 34: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/34.jpg)
Packaging and Distribution Solved• conda and binstar solve most of the problems that
we have seen people encounter in managing Python installations (especially in large-scale institutions).
• They are supported solutions that can remove the technology pain of managing Python
• Allow focus on software architecture and separation of components (not just whatever makes packaging convenient)
![Page 35: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/35.jpg)
AnacondaFree enterprise-ready Python distribution of open-
source tools for large-scale data processing, predictive analytics, and scientific computing
![Page 36: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/36.jpg)
Anaconda Add-Ons (paid-for)
•Revolutionary Python to GPU compiler•Extends Numba to take a subset of Python to the GPU (program CUDA in Python)
•CUDA FFT / BLAS interfaces
Fast, memory-efficient Python interface for SQL databases, NoSQL stores, Amazon S3, and large data files.
NumPy, SciPy, scikit-learn, NumExpr compiled against Intel’s Math Kernel Library (MKL)
![Page 37: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/37.jpg)
Launcher
![Page 38: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/38.jpg)
Why Numba?• Python is too slow for loops•Most people are not learning C/C++/Fortran today•Cython is an improvment (but still verbose and
needs C-compiler)•NVIDIA using LLVM for the GPU•Many people working with large typed-containers
(NumPy arrays)•We want to take high-level, tarray-oriented
expressions and compile it to fast code
![Page 39: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/39.jpg)
NumPy + Mamba = Numba
LLVM Library
Intel Nvidia AppleAMD
OpenCLISPC CUDA CLANGOpenMP
LLVMPY
Python Function Machine Code
ARM
![Page 40: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/40.jpg)
Example
Numba
![Page 41: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/41.jpg)
Numba
@jit('void(f8[:,:],f8[:,:],f8[:,:])')def filter(image, filt, output): M, N = image.shape m, n = filt.shape for i in range(m//2, M-m//2): for j in range(n//2, N-n//2): result = 0.0 for k in range(m): for l in range(n): result += image[i+k-m//2,j+l-n//2]*filt[k, l] output[i,j] = result
~1500x speed-up
![Page 42: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/42.jpg)
Numba changes the game!
LLVM IR
x86C++
ARM
PTX
C
Fortran
Python
Numba turns (a subset of) Python into a “compiled language” as fast as C (but much more
flexible). You don’t have to reach for C/C++
![Page 43: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/43.jpg)
Laplace Example
@jit('void(double[:,:], double, double)')def numba_update(u, dx2, dy2): nx, ny = u.shape for i in xrange(1,nx-1): for j in xrange(1, ny-1): u[i,j] = ((u[i+1,j] + u[i-1,j]) * dy2 + (u[i,j+1] + u[i,j-1]) * dx2) / (2*(dx2+dy2))
Adapted from http://www.scipy.org/PerformancePython originally by Prabhu Ramachandran
@jit('void(double[:,:], double, double)')def numbavec_update(u, dx2, dy2): u[1:-1,1:-1] = ((u[2:,1:-1]+u[:-2,1:-1])*dy2 + (u[1:-1,2:] + u[1:-1,:-2])*dx2) / (2*(dx2+dy2))
![Page 44: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/44.jpg)
Results of Laplace example
Version Time Speed UpNumPy 3.19 1.0Numba 2.32 1.38
Vect. Numba 2.33 1.37Cython 2.38 1.34Weave 2.47 1.29
Numexpr 2.62 1.22Fortran Loops 2.30 1.39Vect. Fortran 1.50 2.13
https://github.com/teoliphant/speed.git
![Page 45: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/45.jpg)
LLVMPy worth looking at
LLVM (via LLVMPy) has done
much heavy lifting
LLVMPy = Compilers for
everybody
![Page 46: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/46.jpg)
New Project
Blaze
NumPy
Out of Core,Distributed and Optimized
NumPy
![Page 47: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/47.jpg)
Blaze Objectives• Flexible descriptor for tabular and semi-structured data
• Seamless handling of:• On-disk / Out of core• Streaming data• Distributed data
• Uniform treatment of:• “arrays of structures” and
“structures of arrays”• missing values• “ragged” shapes• categorical types• computed columns
![Page 48: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/48.jpg)
Blaze Deferred Arrays
+"
A" *"
B" C"
A + B*C
• Symbolic objects which build a graph• Represents deferred computation
Usually what you have when you have a Blaze Array
![Page 49: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/49.jpg)
DataShape Type System
• A data description language• A super-set of NumPy’s dtype• Provides more flexibility• Integration with PADS coming
Shape DType
DataShape
![Page 50: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/50.jpg)
Blaze
Database
GPU Node
Array Server
NFS
Array Server
Array Server
Blaze Client
SynthesizedArray/Table view
array+sql://
array://
file:// array://
Python REPL, Scripts
Viz Data Server
C, C++, FORTRAN
JVM languages
![Page 51: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/51.jpg)
Progress
• Basic calculations work out-of-core (via Numba and LLVM)
• Hard dependency on dynd and dynd-python (a dynamic C++-only multi-dimensional library like NumPy but with many improvements)
• Persistent arrays from BLZ• Basic array-server functionality for layering over CSV
files• 0.2 release in 1-2 weeks. 0.3 within a month after that
(first usable release)
![Page 52: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/52.jpg)
Querying BLZ
In [15]: from blaze import blzIn [16]: t = blz.open("TWITTER_LOG_Wed_Oct_31_22COLON22COLON28_EDT_2012-lvl9.blz")In [17]: t['(latitude>7) & (latitude<10) & (longitude >-10 ) & (longitude < 10) '] # query Out[17]: array([ (263843037069848576L, u'Cossy set to release album:http://t.co/Nijbe9GgShared via Nigeria News for Android. @', datetime.datetime(2012, 11, 1, 3, 20, 56), 'moses_peleg', u'kaduna', 9.453095, 8.0125194, ''),...dtype=[('tid', '<u8'), ('text', '<U140'), ('created_at', '<M8[us]'), ('userid', 'S16'), ('userloc', '<U64'), ('latitude', '<f8'), ('longitude', '<f8'), ('lang', 'S2')])In [18]: t[1000:3000] # get a range of tweets Out[18]: array([ (263829044892692480L, u'boa noite? ;( \ue058\ue41d', datetime.datetime(2012, 11, 1, 2, 25, 20), 'maaribeiro_', u'', nan, nan, ''), (263829044875915265L, u"Nah but I'm writing a gym journal... Watch it last 2 days!", datetime.datetime(2012, 11, 1, 2, 25, 20), 'Ryan_Shizzle', u'Shizzlesville', nan, nan, ''),...
![Page 53: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/53.jpg)
Kiva: Array ServerDataShape + Raw JSON = Web Service
type KivaLoan = { id: int64; name: string; description: { languages: var, string(2); texts: json # map<string(2), string>; }; status: string; # LoanStatusType; funded_amount: float64; basket_amount: json; # Option(float64); paid_amount: json; # Option(float64); image: { id: int64; template_id: int64; }; video: json; activity: string; sector: string; use: string; delinquent: bool; location: { country_code: string(2); country: string; town: json; # Option(string); geo: { level: string; # GeoLevelType pairs: string; # latlong type: string; # GeoTypeType } }; ....
{"id":200533,"name":"Miawand Group","description":{"languages":["en"],"texts":{"en":"Ozer is a member of the Miawand Group. He lives in the 16th district of Kabul, Afghanistan. He lives in a family of eight members. He is single, but is a responsible boy who works hard and supports the whole family. He is a carpenter and is busy working in his shop seven days a week. He needs the loan to purchase wood and needed carpentry tools such as tape measures, rulers and so on.\r\n \r\nHe hopes to make progress through the loan and he is confident that will make his repayments on time and will join for another loan cycle as well. \r\n\r\n"}},"status":"paid","funded_amount":925,"basket_amount":null,"paid_amount":925,"image":{"id":539726,"template_id":1},"video":null,"activity":"Carpentry","sector":"Construction","use":"He wants to buy tools for his carpentry shop","delinquent":null,"location":{"country_code":"AF","country":"Afghanistan","town":"Kabul Afghanistan","geo":{"level":"country","pairs":"33 65","type":"point"}},"partner_id":34,"posted_date":"2010-05-13T20:30:03Z","planned_expiration_date":null,"loan_amount":925,"currency_exchange_loss_amount":null,"borrowers":[{"first_name":"Ozer","last_name":"","gender":"M","pictured":true},{"first_name":"Rohaniy","last_name":"","gender":"M","pictured":true},{"first_name":"Samem","last_name":"","gender":"M","pictured":true}],"terms":{"disbursal_date":"2010-05-13T07:00:00Z","disbursal_currency":"AFN","disbursal_amount":42000,"loan_amount":925,"local_payments":[{"due_date":"2010-06-13T07:00:00Z","amount":4200},{"due_date":"2010-07-13T07:00:00Z","amount":4200},{"due_date":"2010-08-13T07:00:00Z","amount":4200},{"due_date":"2010-09-13T07:00:00Z","amount":4200},{"due_date":"2010-10-13T07:00:00Z","amount":4200},{"due_date":"2010-11-13T08:00:00Z","amount":4200},{"due_date":"2010-12-13T08:00:00Z","amount":4200},{"due_date":"2011-01-13T08:00:00Z","amount":4200},{"due_date":"2011-02-13T08:00:00Z","amount":4200},{"due_date":"2011-03-13T08:00:00Z","amount":4200}],"scheduled_payments": ...
2.9gb of JSON => network-queryable array: ~5 minutes Kiva Array Server Demo
![Page 54: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/54.jpg)
DARPA providing help
DARPA-BAA-12-38: XDATA
TA-1: Scalable analytics and data processing technology TA-2: Visual user interface technology
![Page 55: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/55.jpg)
Bokeh Plotting Library
• Interactive graphics for the web• Designed for large datasets• Designed for streaming data• Native interface in Python• Fast JavaScript component• DARPA funded• v0.1 release imminent
![Page 56: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/56.jpg)
Reasons for Bokeh
1. Plotting must happen near the data too2. Quick iteration is essential => interactive visualization3. Interactive visualization on remote-data => use the browser4. Almost all web plotting libraries are either:
1. Designed for javascript programmers 2. Designed to output static graphs
5. We designed Bokeh to be dynamic graphing in the web for Python programmers
6. Will include “Abstract” or “synthetic” rendering (working on Hadoop and Spark compatibility)
![Page 57: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/57.jpg)
Abstract Rendering
Pixels'are'Bins…'and'always'have'been'
1 2 2 3 4 4 3 2 2 1
A'
D'
B'
C'
B'C'
D'A'
Counts'
Z>View'Geometry'
Pixels'
![Page 58: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/58.jpg)
Hi-def Alpha
![Page 59: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/59.jpg)
Abstract RenderingBasic AR can identify trouble spots in standard plots, and also
offer automatic tone mapping, taking perception into account.
37 mil elements, showing adjacency between entities in Kiva dataset
![Page 60: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/60.jpg)
Wakari
• Browser-based data analysis and visualization platform
• Wordpress / YouTube / Github for data analysis
• Full Linux environment with Anaconda Python
• Can be installed on internal clusters & servers
![Page 61: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/61.jpg)
Why Wakari?• Data is too big to fit on your desktop • You need compute power but don’t have easy access to a
large cluster (cloud is sitting there with lots of power)• Configuration of software on a new system stinks
(especially a cluster).• Collaborative Data Analytics --- you want to build a
complex technical workflow and then share it with others easily (without requiring they do painful configuration to see your results)
• IPython Notebook is awesome --- let’s share it (but we also need the dependencies and data).
![Page 62: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/62.jpg)
Wakari
• Free account has 512 MB RAM / 2 GB disk and shared multi-core CPU
• Easily spin-up map-reduce (Disco and Hadoop clusters)• Use IPython Parallel on many-nodes in the cloud• Develop GUI apps (possibly in Anaconda) and publish
them easily to Wakari (based on full power of scientific python --- complex technical workflows (IPython notebook for now)
![Page 63: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/63.jpg)
Basic Data Explorer
![Page 64: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/64.jpg)
Continuum Data Explorer (CDX)
• Open Source • Goal is interactivity• Combination of IPython REPL, Bokeh, and tables• Tight integration between GUI elements and REPL• Current features
- Namespace viewer (mapped to IPython namespace)- DataTable widget with group-by, computed columns, advanced-
filters- Interactive Plots connected to tables
![Page 65: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/65.jpg)
CDX
![Page 66: London level39](https://reader036.vdocuments.mx/reader036/viewer/2022070304/54c684cf4a7959ab1a8b458a/html5/thumbnails/66.jpg)
Conclusion
Projects circle around giving tools to experts (occasional programmers or domain experts) to enable them to move their expertise to the data to get insights --- keep data where it is and move high-level but performant code)
Join us or ask how we can help you!