scientific world in python

32
Scientific World in Python brief introduction to SciPy stack Jiangwei Guo Data Management Center January 19, 2017 Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 1 / 32

Upload: -

Post on 24-Jan-2017

90 views

Category:

Engineering


0 download

TRANSCRIPT

Page 1: Scientific world in python

Scientific World in Pythonbrief introduction to SciPy stack

Jiangwei Guo

Data Management Center

January 19, 2017

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 1 / 32

Page 2: Scientific world in python

Outline

1 SciPy movement

2 Core SciPy libs for MLNumPySciPypandasscikit-learn

3 bonus

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 2 / 32

Page 3: Scientific world in python

why Python?

hot and strong, widely used in industries

simple and elegant, easy to learn

whole ecosystem and active communities

glue language, standard ML-API language

good for and widely used as prototyping language

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 3 / 32

Page 4: Scientific world in python

Python’s next steps

“One thing I want to point out arethe SciPy and NumPy movements.Those people are introducing Pythonas a replacement for MatLab. It’sopen source, it’s better, they canchange it. They are taking it toplaces where I had never expectedPython would travel. They havethings like the Jupiter Notebooksthat show interactive Python in thebrowser. There is a lot of incrediblycool work that is happening in thatarea.”

– Guido van Rossum

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 4 / 32

Page 5: Scientific world in python

SciPy stack

SciPy stack is a collection ofopen source software forscientific computing in Python.

Implementation andenhancement of MatLab withpython and followed by SparkMLib.

One of the most active pythoncommunities and sponsored byNumFOCUS.

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 5 / 32

Page 6: Scientific world in python

SciPy stack - continued

NumPy fundamental package for numerical computation.

SciPy a collection of numerical algorithms and domain-specifictoolboxes.

Matplotlib a mature and popular plotting package.

pandas providing high-performance, easy to use data structures.

Scikits extra packages for more specific functionality, such asscikit-image, scikit-learn, etc.

IPython a rich interactive interface, jupyter allows web access.

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 6 / 32

Page 7: Scientific world in python

NumFOCUS as sponsor

The mission of NumFOCUS is to promote sustainable high-levelprogramming languages, open code development, and reproduciblescientific research.

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 7 / 32

Page 8: Scientific world in python

Intro to NumPy

NumPy is short for numerical python library, latest released version is1.12.

fundamental and standard package for scientific and numericalcomputing, key concept is N-dimensional array.

I powerful N-dimensional array objectI a grid of values, all of the same typeI sophisticated functions and routinesI indexed by a tuple of nonnegative integers

recommended import style

import numpy as np

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 8 / 32

Page 9: Scientific world in python

Universal functions (ufunc)

operates on ndarrays in an element-by-element fashion.

broadcastingI used through NumPy to decide how to handle disparately shaped arrays

when performing arithmetic operationsI broadcastable

1 the arrays all have exactly the same shape2 the arrays have the same number of dimensions and length of each

dimensions is either a common length or 13 the arrays that have too few dimensions can have their shapes

prepended with a dimension of length 1 to satisfy property 2

type casting

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 9 / 32

Page 10: Scientific world in python

Slicing and Indexing

1 slice object (start:stop:step notation inside of brackets), an integer, ora tuple of slice objects and integers

Example

a = np.arange(30).reshape(10, -1); a[9]; a[2:7:2]; a[:, 1]

2 Integer array indexing

Example

a = np.arange(6).reshape(3, -1); a[[0, 1, 2], [0, 1, 0]]

3 Boolean array indexing

Example

a = np.arange(10).reshape(4, -1); a[a > 2]

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 10 / 32

Page 11: Scientific world in python

Routinesarray manipulation

I change dimensions, reshape or flattenI transpose and transpose-likeI join, concatenate, split, ect.

element-level mathematical routine

Example

a = np.arange(12).reshape(3, -1)

np.sum(a); np.prod(a, axis=0); np.cumsum(a, axis=0)

np.log10(a)

ndarray-level mathematical routine

Example

x + y; x y; x * y; x / y; np.add(x, y) (element-wise)

np.subtract(x, y), np.multiply(x, y), np.divide(x, y)

x.dot(y); np.dot(x, y) (array-like)

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 11 / 32

Page 12: Scientific world in python

Intro to SciPy

SciPy is short for scientific python library, latest released version is0.18.1.

SciPy is a collection of mathematical algorithms and conveniencefunctions built on the Numpy extension of Python.

Scipy has rich of high-level numerical routines, ontains varioustoolboxes dedicated to common issues in scientific computing.

With SciPy an interactive Python session becomes a data-processingand system-prototyping environment rivaling systems such asMATLAB, IDL, Octave, R-Lab, and SciLab.

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 12 / 32

Page 13: Scientific world in python

difference between NumPy and SciPy

“... NumPy is meant to be a library for numerical arrays, to be usedby anybody needing such an object in Python. SciPy is meant to be alibrary for scientists/engineers, so it aims for more rigorous theoreticalmathematics.”

“In an ideal world, NumPy would contain nothing but the array datatype and the most basic operations: indexing, sorting, reshaping,basic element-wise functions, et cetera. All numerical code wouldreside in SciPy.”

“NumPy contains some linear algebra functions, even though thesemore properly belong in SciPy. In any case, SciPy contains morefully-featured versions of the linear algebra modules, as well as manyother numerical algorithms.”

“... all of the Numpy functions have been subsumed into the scipynamespace so that all of those functions are available withoutadditionally importing Numpy, the scipy init method execute afrom numpy import *.”

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 13 / 32

Page 14: Scientific world in python

Sparse matrices (scipy.sparse)

Sparse matrices can be used in arithmetic operations: they supportaddition, subtraction, multiplication, division, and matrix power.

Advantages of the CSR formatI efficient arithmetic operations CSR + CSR, CSR * CSR, etc.I efficient row slicingI fast matrix vector products

Disadvantages of the CSR formatI slow column slicing operations (consider CSC)I changes to the sparsity structure are expensive (consider LIL or DOK)

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 14 / 32

Page 15: Scientific world in python

Intro to pandas

pandas derives from panel data library, latest released stable version is0.19.2.

pandas provide fast, flexible, and expressive data structures designedto make working with relational or labeled data both easy andintuitive.

pandas is not the implementation or extension of NumPy/SciPy, itmanipulates data in specific way.

main data structures includes Series, DataFrame (much likedata.frame in R), Panel, etc.

recommended import style

import pandas as pd

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 15 / 32

Page 16: Scientific world in python

advantages of using pandas over NumPy

pandas is well suited for tabular data, time series data, arbitrarymatrix data with row and column labels.

skilled in row- and column-oriented operations, especially with labelsI columns or rows can be inserted and deleted from DataFrameI automatic and explicit data alignment, objects can be explicitly aligned

to a set of labelsI intelligent label-based slicing, fancy indexing, and subsetting of large

data setsI intuitive merging and joining data sets

pandas support many statistics methods for variable analysis, such asgroup-by, pivoting

easy handling of missing data (represented as NaN)

seamless integration with python data structures and NumPy

robust IO tools for loading and dumping data

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 16 / 32

Page 17: Scientific world in python

indexing and slicing

standard Python/NumPy expressions for selecting and setting areintuitive and come in handy

attribute access

recommend optimized pandas data access methods, loc/at forlabel-based access and iloc/iat for location-based access, loc for blockaccess and at for scalar access

boolean indexing

Example

df[:, 0:2]; df[::-1]; df[:, :3]; df[’a’]; df.head()

df.a

df.loc[’a’:’c’, [’A’, ’D’]]; df.at[’a’, ’A’]

df.iloc[::-1, 2:4]; df.iat[0, 0]

df[df.A > 0]; df[df[’a’].isin([’x’, ’z’])]

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 17 / 32

Page 18: Scientific world in python

interactive data analysing as R does

Example

df = pd.read csv(’path/data.csv’)

df.head(); df.shape; df.dtypes

df.describe(); df.mean()

df.dropna(how=’any’); df.dropna(axis=1), default value is 0

df.fillna(value=5); df[’d’].fillna(3); df.fillna(df.mean())

df[’a’] = df[’a’].apply(np.log10)

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 18 / 32

Page 19: Scientific world in python

interactive data analysing as R does - continued

Example

df.groupby(’a’).sum()

df.groupby([’a’, ’b’]).mean()

pd.pivot table(df, values=’d’, index=[’a’, ’b’], columns=[’c’])

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 19 / 32

Page 20: Scientific world in python

merge dataframes

pd.concat()

Example

assume df1.shape = (3, 4), df2.shape = (3, 4)

pd.concat([df1, df2]), shape of result is (6, 4)

pd.concat([df1, df2], axis=1), shape of result is (3, 8)

df.append(), append rows to a dataframe, defaults considering labels

pd.merge(left, right, how=’inner’, on=None, ...), SQL style merges

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 20 / 32

Page 21: Scientific world in python

Intro to scikit-learn

scikit-learn project started as scikits.learn, a Google Summer of Codeproject by David Cournapeau, its name stems from the SciKit (SciPyToolkit), a separately-developed and distributed third-party extensionto SciPy, the latest stable version is 0.19.2.

sklearn can be used in two typical waysI interactive use in interactive interpreter, enhanced by IpythonI classes or functions import to Python projects

Machine Learning in PythonI Simple and efficient tools for data mining and data analysisI Accessible to everybody, and reusable in various contextsI Built on NumPy, SciPy, and matplotlibI Open source, commercially usable - BSD license

recommended import style

from sklearn.xxx import xxx

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 21 / 32

Page 22: Scientific world in python

overview of sklearn and ML

dataset transformationI feature extraction, efficient for text encodingI preprocessing data, such as normalization,

encode categorical variablesI pipeline and feature union, preceding class

must implement transform interface, the lastclass decides the whole pipeline’s functions

modelingI almost all ML algorithms

F supervised, unsupervised, semi-supervisedF regression, classification, clustering,

dimensionality reduction, label propagation

I feature selectionI ensemble, bagging and boosting (bias-variance

tradeoff)I scikit-learn wrapper interface for third-party

ML libs

“The selected featuresdecide the limits ofthe model, thedifferent algorithmsjust approaching thelimits ofperformances.”

– nobody

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 22 / 32

Page 23: Scientific world in python

overview of sklearn and ML - continued

model valuation and parameter tuningI compulsory implementation of score for predictorI cross-validation for evaluating estimator performance, such as k-fold,

leave one out and ect.I standard model valuation function, such as roc curve and auc for

classifiersI tuning the hyper-parameters of an estimator, such GridSearchCV,

RandomizedSearchCV and ect.

model persistence with pickle/cpickle

datasets and examples

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 23 / 32

Page 24: Scientific world in python

naming policy for scikit-learn

classesI Camel naming

functionsI joined lower case wordsI fit, predict, score, transform, apply, predict proba, set params,

set params, etc.

parametersI joined lower case words

learned attributesI joined lower case words, trailed with

data setsI X, y, X train, X test

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 24 / 32

Page 25: Scientific world in python

API design philosophy - general principles

1 consistency, all objects share a consistent interface composed of alimited set of methods

2 inspection, constructor parameters and parameter values determinedby learning algorithms are stored and exposed as public attributes

3 non-proliferation of classes, learning algorithms are the only objects tobe represented using custom classes

4 composition, meta transformers/estimators and ensemble functions

5 sensible defaults, appropriate default value for user-defined parametersis defined as much as possible

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 25 / 32

Page 26: Scientific world in python

API design philosophy - data representation

as close as possible to the matrix representation, NumPy for densedata and SciPy for sparse data

for efficient reasons, the public interface is oriented towardsprocessing batches of samples rather than single samples per API call.

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 26 / 32

Page 27: Scientific world in python

API design philosophy - core interface

estimatorsI defines instantiation mechanisms of algorithm objectsI expose fit method for learning model from training dataI estimator initialization and actual learning are strictly separated

predictorsI extends the notion of estimator by adding a predict method that

produces predictions for X testI classify predictors also provide a predict proba method which returns

class probabilitiesI predictors also provide a score function to assess the estimator’s

performance on a batch of input data

transformersI modify or filter data before feeding it to a learning algorithmI some estimators implement a transformer

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 27 / 32

Page 28: Scientific world in python

introduction to matplotlib

matplotlib is a python 2D plottinglibrary, seamless combined with SciPystack

pyplot module provides a MATLAB-likeinterface

matplotlib can be used in Pythonscripts, the Python and IPython shell,the jupyter notebook

R’s plot library ggplot2 is animplementation of Leland Wilkinson’sGrammar of Graphics : a generalscheme for data visualization whichbreaks up graphs into semanticcomponents such as scales and layers

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 28 / 32

Page 29: Scientific world in python

introduction to Ipython

IPython is an enhanced interactive Python shell that has lots of interestingfeatures including named inputs and outputs, access to shell commands,improved debugging and many more.Jupyter Notebook App (formerly IPython Notebook) is an applicationrunning inside the browser.

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 29 / 32

Page 30: Scientific world in python

introduction to Anaconda

Anaconda is an easy-to-install free package manager, environmentmanager, Python distribution, and collection of over 720 open sourcepackages offering free community support.Anaconda is the recommended distribution of Python distribution forscientific project.

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 30 / 32

Page 31: Scientific world in python

other scientific libs

TensorFlow An open-source software library for Machine Intelligence

gensim topic modelling for humans

networkX High-productivity software for complex networks

NLTK Natural Language Toolkit

XGBoost eXtreme Gradient Boosting

scikit-learn-contrib scikit-learn compatible projects, such asimbalanced-learn and ect.

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 31 / 32

Page 32: Scientific world in python

REFERENCES

1 http://www.scipy-lectures.org/index.html

2 http://scikit-learn.org/stable/index.html

3 https://scipy.org/docs.html

4 https://www.tensorflow.org/

5 https://radimrehurek.com/gensim/

6 https://networkx.github.io/

7 http://www.nltk.org/

8 https://github.com/dmlc/xgboost

9 https://github.com/scikit-learn-contrib

10 API design for machine learning software: experiences from thescikit-learn project (2013)

Jiangwei Guo (Data Management Center) Scientific World in Python January 19, 2017 32 / 32