python for science and engineering: a presentation to a*star and the singapore computational...
DESCRIPTION
An introduction to Python in science and engineering. The presentation was given by Dr Edward Schofield of Python Charmers (www.pythoncharmers.com) to A*STAR and the Singapore Computational Sciences Club in June 2011.TRANSCRIPT
Python for Science and Engineering
Dr Edward Schofield
A*STAR / Singapore Computational Sciences Club SeminarJune 14, 2011
Scientific programming in 2011
Most scientists and engineers are:
programming for 50+% of their work time (and rising)
self-taught programmers
using inefficient programming practices
using the wrong programming languages: C++, FORTRAN, C#, PHP, Java, ...
Scientific programming needs
Rapid prototyping
Efficiency for computational kernels
Pre-written packages!
Vectors, matrices, modelling, simulations, visualisation
Extensibility; web front-ends; database backends; ...
Ed's story:How I found Python
PhD in statistical pattern recognition: 2001-2006
Needed good tools for my research!
Discovered Python in 2002 after frustration with C++, Matlab, Java, Perl
Contributed to NumPy and SciPy:
maxent, sparse matrices, optimization, Monte Carlo, etc.
Managed six releases of SciPy in 2005-6
1. Why Python?
Introducing Python
What is it?
What is it good for?
Who uses it?
What is Python?
interpreted
strongly but dynamically typed
object-oriented
intuitive, readable
open source, free
‘batteries included’
‘batteries included’
Python’s standard library is:
very large
well-supported
well-documented
Python’s standard library
data types strings networking threads
operating system compression GUI arguments
CGI complex numbers FTP cryptography
testing multimedia databases CSV files
calendar email XML serialization
What is an efficient programming language?
Native Python code executes 10x more slowly than C and FORTRAN
Would you build a racing car ...... to get to Kuala Lumpur ASAP?
Date Cost per GFLOPS (US $) Technology
1961 US $1.1 trillion 17 million IBM 1620s
1984 US $15,000,000 Cray X-MP
1997 US $30,000 Two 16-CPU clusters of Pentiums
2000, Apr $1000 Bunyip Beowulf cluster
2003, Aug $82 KASY0
2007, Mar $0.42 Ambric AM2045
2009, Sep $0.13 ATI Radeon R800
Source: Wikipedia: “FLOPS”
Unit labor cost growthProxy for cost of programmer time
Efficiency
When FORTRAN was invented, computer time was more expensive than programmer time.
In the 1980s and 1990s that reversed.
Efficient programming
Python code is 10x faster to write than C and FORTRAN
What if ...... you now need to reach Sydney?
Advantages of Python
Easy to write
Easy to maintain
Great standard libraries
Thriving ecosystem of third-party packages
Open source
‘Batteries included’
Python’s standard library is:
very large
well supported
well documented
Python’s standard library
data types strings networking threads
operating system compression GUI arguments
CGI complex numbers FTP cryptography
testing multimedia databases CSV files
calendar email XML serialization
QuestionWhat is the date 177 days from now?
Natural applications of Python
Rapid prototyping
Plotting, visualisation, 3D
Numerical computing
Web and database programming
All-purpose glue
Python vs other languages
Languages used at CSIRO
Python Fortran Java
Matlab C VB.net
IDL C++ R
Perl C# +5-10 others!
Which language do I choose?
A different language for each task?
A language you know?
A language others in your team are using: support and help?
Python Matlab
Interpreted Yes Yes
Powerful data input/output Yes Yes
Great plotting Yes Yes
General-purpose language Powerful Limited
Cost Free $$$
Open source Yes No
Python C++
Powerful Yes Yes
Portable Yes In theory
Standard libraries Vast Limited
Easy to write and maintain Yes No
Easy to learn Yes No
Python C
Fast to write Yes No
Good for embedded systems, device drivers and operating systems No Yes
Good for most other high-level tasks Yes No
Standard library Vast Limited
Python Java
Powerful, well-designed language Yes Yes
Standard libraries Vast Vast
Easy to learn Yes No
Code brevity Short Verbose
Easy to write and maintain Yes Okay
Open source
Python is open source software
Benefits:
No vendor lock-in
Cross-platform
Insurance against bugs in the platform
Free
Python success stories
Computer graphics:
Industrial Light & Magic
Web:
Google: News, Groups, Maps, Gmail
Legacy system integration:
AstraZeneca - collaborative drug discovery
Python success stories (2)
Aerospace:
NASA
Research:
universities worldwide ...
Others:
YouTube, Reddit, BitTorrent, Civilization IV,
Industrial Light & Magic
Python spread from scripting to the entire production pipeline
Numerous reviews since 1996: Python is still the best tool for them
United Space Alliance
A common sentiment:
“We achieve immediate functioning code so much faster in Python than in any other language that it’s staggering.”
- Robin Friedrich, Senior Project Engineer
Case study: air-traffic control
Eric Newton, “Python for Critical Applications”: http://metaslash.com/brochure/recall.html
Metaslash, Inc: 1999 to 2001
Mission-critical system for air-traffic control
Replicated, fault-tolerant data storage
Case study: air-traffic control
Python prototype -> C++ implementation -> Python again
Why?
C++ dependencies were buggy
C++ threads, STL were not portable enough
Python’s advantages over C++
More portable
75% less code: more productivity, fewer bugs
More case studies
See http://www.python.org/about/success/ for lots more case studies and success stories
2. The scientific Python ecosystem
Scientific software development
Small beginnings
Piecemeal growth, quirky interfaces
... Large, cumbersome systems
NumPyAn n-dimensional array/matrix package
NumPyCentre of Python’s numerical computing ecosystem
NumPy
The most fundamental tool for numerical computing in Python
Fast multi-dimensional array capability
What NumPy defines:
Two fundamental objects:
1. n-dimensional array
2. universal function
a rich set of numerical data types
nearly 400 functions and methods on arrays:
type conversions
mathematical
logical
NumPy's features
Fast. Written in C with BLAS/LAPACK hooks.
Rich set of data types
Linear algebra: matrix inversion, decompositions, …
Discrete Fourier transforms
Random number generation
Trig, hypergeometric functions, etc.
Elementwise array operations
Loops are mostly unnecessary
Operate on entire arrays!>>> a = numpy.array([20, 30, 40, 50])>>> a < 35array([True, True, False, False], dtype=bool)>>> b = numpy.arange(4)>>> a - barray([20, 29, 38, 47])>>> b**2array([0, 1, 4, 9])
Universal functions
NumPy defines 'ufuncs' that operate on entire arrays and other sequences (hence 'universal')
Example: sin()>>> a = numpy.array([20, 30, 40, 50])>>> c = 10 * numpy.sin(a)>>> carray([ 9.12945251, -9.88031624, 7.4511316 , -2.62374854])
Array slicing
Arrays can be sliced and indexed powerfully:>>> a = numpy.arange(10)**3>>> aarray([ 0, 1, 8, 27, 64, 125, 216, 343, 512, 729])>>> a[2:5]array([ 8, 27, 64])
Fancy indexing
Arrays can be used as indices into other arrays:
>>> a = numpy.arange(12)**2>>> ind = numpy.array([ 1, 1, 3, 8, 5 ])>>> a[ind]array([ 1, 1, 9, 64, 25])
Other linear algebra features
Matrix inversion: mat(A).I
Or: linalg.inv(A)
Linear solvers: linalg.solve(A, x)
Pseudoinverse: linalg.pinv(A)
What is SciPy?
A community
A conference
A package of scientific libraries
Python for scientific software
Back-end: computational work
Front-end: input / output, visualization, GUIs
Dozens of great scientific packages exist
Python in science (2)
NumPy: numerical / array moduleMatplotlib: great 2D and 3D plotting libraryIPython: nice interactive Python shellSciPy: set of scientific libraries: sparse matrices, signal processing, …RPy: integration with the R statistical environment
Python in science (3)
Cython: C language extensionsMayavi: 3D graphics, volumetric renderingNitimes, Nipype: Python tools for neuroimagingSymPy: symbolic mathematics library
Python in science (4)
VPython: easy, real-time 3D programming
UCSF Chimera, PyMOL, VMD: molecular graphics
PyRAF: Hubble Space Telescope interface to RAF astronomical data
BioPython: computational molecular biology
Natural language toolkit: symbolic + statistical NLP
Physics: PyROOT
The SciPy packageBSD-licensed software for maths, science, engineering
integration signal processing sparse matrices
optimization linear algebra maximum entropyinterpolation ODEs statistics
FFTs n-dim image processing scientific constants
clustering interpolation C/C++ and Fortran integration
SciPy optimisation exampleFit a model to noisy data:y = a/xb sin(cx)+ε
Example: fitting a model with scipy.optimize
Task: Fit a model of the form y = a/bx sin(cx)+εto noisy data.
Spec:
1. Generate noisy data
2. Choose parameters (a, b, c) to minimize sum squared errors
3. Plot the data and fitted model (next session)
SciPy optimisation exampleimport numpyimport pylabfrom scipy.optimize import leastsq
def myfunc(params, x): (a, b, c) = params return a / (x**b) * numpy.sin(c * x)
true_params = [1.5, 0.1, 2.]def f(x): return myfunc(true_params, x)
def err(params, x, y): # error function return myfunc(params, x) - y
SciPy optimisation example# Generate noisy data to fit n = 30; xmin = 0.1; xmax = 5x = numpy.linspace(xmin, xmax, n)y = f(x)y += numpy.rand(len(x)) * 0.2 * \ (y.max() - y.min())
v0 = [3., 1., 4.] # initial param estimate# Fittingv, success = leastsq(err, v0, args=(x, y), maxfev=10000)
print 'Estimated parameters: ', vprint 'True parameters: ', true_paramsX = numpy.linspace(xmin, xmax, 5 * n)pylab.plot(x, y, 'ro', X, myfunc(v, X))pylab.show()
SciPy optimisation exampleFit a model to noisy data:y = a/xb sin(cx)+ε
Ingredients for this example
numpy.linspace
numpy.random.rand for the noise model (uniform)
scipy.optimize.leastsq
Sparse matrix exampleConstruct and solve a sparse linear system
Sparse matricesSparse matrices are mostly zeros.
They can be symmetric or asymmetric.
Sparsity patterns vary:
block sparse, band matrices, ...
They can be huge!
Only non-zeros are stored.
Sparse matrices in SciPy
SciPy supports seven sparse storage schemes
... and sparse solvers in Fortran.
Sparse matrix creation
To construct a 1000x1000 lil_matrix and add values:>>> from scipy.sparse import lil_matrix>>> from numpy.random import rand>>> from scipy.sparse.linalg import spsolve
>>> A = lil_matrix((1000, 1000))>>> A[0, :100] = rand(100)>>> A[1, 100:200] = A[0, :100]>>> A.setdiag(rand(1000))
Solving sparse matrix systems
Now convert the matrix to CSR format and solve Ax=b:>>> A = A.tocsr()>>> b = rand(1000)>>> x = spsolve(A, b)
# Convert it to a dense matrix and solve, and check that the result is the same:>>> from numpy.linalg import solve, norm>>> x_ = solve(A.todense(), b)# Compute norm of the error:>>> err = norm(x - x_)>>> err < 1e-10True
Matplotlib
Great plotting package in Python
Matlab-like syntax
Great rendering: anti-aliasing etc.
Many ‘backends’: Cairo, GTK, Cocoa, PDF
Flexible output: to EPS, PS, PDF, TIFF, PNG, ...
Matplotlib: worked examplesSearch the web for 'Matplotlib gallery'
Example: NumPy vectorization1. Use a Monte Carlo algorithm to
estimate π:
1. Generate uniform random variates (x,%y) over [0, 1].
2. Estimate π from the proportion p that land in the unit circle.
2. Time two ways of doing this:
1. Using for loops
2. Using array operations (vectorized)
3. Scaling
HPCHigh-performance computing
Aspects to HPC
Supercomputers Distributed clusters / grids
Parallel programming Scripting
Caches, shared memory Job control
Code porting Specialized hardware
Python for HPCAdvantages Disadvantages
Portability Global interpreter lock
Easy scripting, glue Less control than C
Maintainability Native loops are slow
Profiling to identify hotspots
Vectorization with NumPy
Large data sets
Useful Python language features:
Generators, iterators
Useful packages:
Great HDF5 support from PyTables!
Hierarchical dataDatabases without the relational baggage
Great interface for HDF5 dataEfficient support for massive data sets
Applications of PyTables
aeronautics telecommunications
drug discovery data mining
financial analysis statistical analysis
climate prediction etc.
Breaking news: June 2011
PyTables Pro is now being open sourced.
Indexed searches for speed
Merging with PyTables
Working project name: NewPyTables
PyTables performance
OPSI indexing engine speed:
Querying 10 billion rows can take hundredths of a second!
Target use-case:
mostly read-only or append-only data
Principles for efficient code
Important principles
1. "Premature optimization is the root of all evil"
Don't write cryptic code just to make it more efficient!
2. 1-5% of the code takes up the vast majority of the computing time!
... and it might not be the 1-5% that you think!
Checklist for efficient codeFrom most to least important:
1. Check: Do you really need to make it more efficient?
2. Check: Are you using the right algorithms and data structures?
3. Check: Are you reusing pre-written libraries wherever possible?
4. Check: Which parts of the code are expensive? Measure, don't guess!
Relative efficiency gains
Exponential-order and polynomial-order speedups are possible by choosing the right algorithm for a task.
These require the right data structures!
These dwarf 10-25x linear-order speedups from:
using lower-level languages
using different language constructs.
4. About Python Charmers
The largest Python training provider in South-East Asia
Delighted customers include:
Most popular course topicsPython for Programmers 3 days
Python for Scientists and Engineers 4 days
Python for Geoscientists 4 days
Python for Bioinformaticians 4 days
Python for Financial Engineers 4 daysPython for IT Security Professionals 3 days
New courses:
Python Charmers:Topics of expertise
Python: beginners, advanced
Scientific data processing with Python
Software engineering with Python
Large-scale problems: HPC, huge data sets, grids
Statistics and Monte Carlo problems
Python Charmers:Topics of expertise (2)
Spatial data analysis / GIS
General scripting, job control, glue
GUIs with PyQt
Integrating with other languages: R, C, C++, Fortran, ...
Web development in Django