stats 700-002 data analysis using pythonpie chart; the only worse design than a pie chart is several...

47
STATS 700-002 Data Analysis using Python Lecture 5: numpy and matplotlib Some examples adapted from A. Tewari

Upload: others

Post on 05-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

STATS 700-002Data Analysis using Python

Lecture 5: numpy and matplotlibSome examples adapted from A. Tewari

Page 2: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Reminder!If you don’t already have a Flux/Fladoop username, request one promptly!

Make sure you have a way to ssh to the Flux clusterUNIX/Linux/MacOS: you’re all set!Windows:

install PuTTY:https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html

and you may also want cygwin https://www.cygwin.com/

You also probably want to set up VPN to access Flux from off-campus:http://its.umich.edu/enterprise/wifi-networks/vpn

Page 3: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Numerical computing in Python: numpyOne of a few increasingly-popular, free competitors to MATLAB

Numpy quickstart guide: https://docs.scipy.org/doc/numpy-dev/user/quickstart.html

For MATLAB fans:https://docs.scipy.org/doc/numpy-dev/user/numpy-for-matlab-users.html

Closely related package scipy is for optimizationSee https://docs.scipy.org/doc/

Page 4: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

numpy data types:Five basic numerical data types:

boolean (bool)integer (int)unsigned integer (uint)floating point (float)complex (complex)

Many more complicated data types are availablee.g., each of the numerical types can vary in how many bits it useshttps://docs.scipy.org/doc/numpy/user/basics.types.html

Page 5: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

numpy.array: numpy’s version of Python array (i.e., list)Can be created from a Python list…

...by “shaping” an array…

...by “ranges”...

...or reading directly from a filesee https://docs.scipy.org/doc/numpy/user/basics.creation.html

Page 6: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

numpy allows arrays of arbitrary dimension (tensors)1-dimensional arrays:

2-dimensional arrays (matrices):

3-dimensional arrays (“3-tensor”):

Page 7: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

More on numpy.arange creationnp.arange(x): array version of Python’s range(x), like [0,1,2,...,x-1]

np.arange(x,y): array version of range(x,y), like [x,x+1,...,y-1]

np.arange(x,y,z): array of elements [x,y) in z-size increments.

Related useful functions, that give better/clearer control of start/endpoints and allow for multidimensional arrays:

https://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.htmlhttps://docs.scipy.org/doc/numpy/reference/generated/numpy.ogrid.htmlhttps://docs.scipy.org/doc/numpy/reference/generated/numpy.mgrid.html

Page 8: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

More on numpy.arange creationnp.arange(x): array version of Python’s range(x), like [0,1,2,...,x-1]

np.arange(x,y): array version of range(x,y), like [x,x+1,...,y-1]

np.arange(x,y,z): array of elements [x,y) in z-size increments.

Page 9: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

numpy array indexing is highly expressive...

Not very relevant to us right now…

...but this will come up again in a few weeks when we cover TensorFlow!

Page 10: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

More array indexingNumpy allows MATLAB/R-like indexing by Booleans

This error is by design, believe it or not! The designers of numpy were concerned about ambiguities in Boolean vector operations, so they split the two operations into two separate methods, x.any() and x.all()

Page 11: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Boolean operations: np.any(), np.all()Analogous to and and or, respectively

axis argument picks which axis along which to perform the Boolean operation. If left unspecified, it treats the array as a single vector.

Setting axis to be the first (i.e., 0-th) axis yields the entrywise behavior we wanted.

Page 12: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Boolean operations: np.logical_and()Numpy also has built-in Boolean vector operations, which are simpler/clearer at the cost of the expressiveness of np.any(), np.all().

This is an example of a numpy “universal function” (ufunc), which we’ll discuss more in a few slides.

Page 13: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Random numbers in numpynp.random contains methods for generating random numbers

Lots more distributions:https://docs.scipy.org/doc/numpy/reference/routines.random.html#distributions

Page 14: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

np.random.choice(): random samples from datanp.random.choice(x,[size,replace,p])

Generates a sample of size elements from the array x, drawn with (replace=True) or without (replace=False) replacement, with element probabilities given by vector p.

Page 15: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

shuffle() vs permutation()np.random.shuffle(x)

randomly permutes entries of x in placeso x itself is changed by this operation!

np.random.permutation(x)returns a random permutation of xand x remains unchanged.

Page 16: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Statistics in numpyNumpy implements all the standard statistics functions you’ve come to expect

Page 17: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Statistics in numpy (cont’d)Numpy deals with NaNs more gracefully than MATLAB/R:

For more basic statistical functions, see: https://docs.scipy.org/doc/numpy-1.8.1/reference/routines.statistics.html

Page 18: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Probability and statistics in scipyAll the distributions you could possibly ever want:

https://docs.scipy.org/doc/scipy/reference/stats.html#continuous-distributionshttps://docs.scipy.org/doc/scipy/reference/stats.html#multivariate-distributionshttps://docs.scipy.org/doc/scipy/reference/stats.html#discrete-distributions

More statistical functions (moments, kurtosis, statistical tests):https://docs.scipy.org/doc/scipy/reference/stats.html#statistical-functions

Second argument is the name of a distribution in scipy.stats

Kolmogorov-Smirnov test

Page 19: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

numpy/scipy universal functions (ufuncs)From the documentation:

A universal function (or ufunc for short) is a function that operates on ndarrays in an element-by-element fashion, supporting array broadcasting, type casting, and several other standard features. That is, a ufunc is a “vectorized” wrapper for a function that takes a fixed number of scalar inputs and produces a fixed number of scalar outputs.https://docs.scipy.org/doc/numpy/reference/ufuncs.html

So ufuncs are vectorized operations, just like in R and MATLAB

Page 20: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

ufuncs in actionlist comprehensions are great, but they’re not well-suited to numerical computing

Page 21: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Sorting with numpy/scipy

Sorting is along the “last” axis by default. Note contrast with np.any() . To treat the array as a single vector, axis must be set to None.

ASCII rears its head-- capital letters are “earlier” than all lower-case by default.

Original array is unchanged by use of np.sort() , like Python’s built-in sorted()

Page 22: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

A cautionary notenumpy/scipy have a number of similarly-named functions with different behaviors!

Example: np.amax, np.ndarray.max, np.maximum

The best way to avoid these confusions is to1) Read the documentation carefully2) Test your code!

Page 23: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Plotting with matplotlibmatplotlib is a plotting library for use in Python

Similar to R’s ggplot2 and MATLAB’s plotting functions

For MATLAB fans, matplotlib.pyplot implements MATLAB-like plotting:http://matplotlib.org/users/pyplot_tutorial.html

Sample plots with code:http://matplotlib.org/tutorials/introductory/sample_plots.html

Page 24: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Basic plotting: matplotlib.pyplot.plotmatplotlib.pyplot.plot(x,y)

plots y as a function of x.

matplotlib.pyplot(t)sets x-axis to np.arange(len(t))

Page 25: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Basic plotting: matplotlib.pyplot.plot

Jupyter “magic” command to make images appear in-line.

Python ‘_’ is a placeholder, similar to MATLAB ‘~’. Tells Python to treat this like variable assignment, but don’t store result anywhere.

Page 26: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Customizing plotsSecond argument to pyplot.plot specifies line type, line color, and marker type. Specify broader array of colors, line weights, markers, etc., using long-hand arguments.

Page 27: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Customizing plots

Long form of the command on the previous slide. Same plot!

A full list of the long-form arguments available to pyplot.plot are available in the table titled“Here are the available Line2D properties.”:http://matplotlib.org/users/pyplot_tutorial.html

Page 28: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Multiple lines in a single plot

Note: more complicated specification of individual lines can be achieved by adding them to the plot one at a time.

Page 29: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Multiple lines in a single plot: long form

Note: same plot as previous slide, but specifying one line at a time so we could, if we wanted, use more complicated line attributes.

Page 30: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Titles and axis labels

Page 31: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Titles and axis labelsChange font sizes

Page 32: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Legends

pyplot.legend generates legend based on label arguments passed to pyplot.plot . loc=‘best’ tells pyplot to place the legend where it thinks is best.

Can use LaTeX in labels, titles, etc.

Page 33: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Gridlines on/off: plt.grid

Page 34: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Annotating figures

Specify text coordinates and coordinates of the arrowhead using the coordinates of the plot itself. This is pleasantly different from many other plotting packages, which require specifying coordinates in pixels!

Page 35: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Plotting histograms: pyplot.hist()

Page 36: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Plotting histograms: pyplot.hist()

Bin counts. Note that if normed=1 , then these will be proportions between 0 and 1 instead of counts.

Page 37: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Bar plotsbar(x, height, *, align='center', **kwargs)

Full set of available arguments to bar(...) can be found at http://matplotlib.org/api/_as_gen/matplotlib.pyplot.bar.html#matplotlib.pyplot.bar

Horizontal analogue given by barh http://matplotlib.org/api/_as_gen/matplotlib.pyplot.barh.html#matplotlib.pyplot.barh

Page 38: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Tick labels

Can specify what the x-axis tic labels should be by using the tick_label argument to plot functions.

Page 39: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Box & whisker plots

plt.boxplot(x,...) : x is the data. Many more optional arguments are available, most to do with how to compute medians, confidence intervals, whiskers, etc. See http://matplotlib.org/api/_as_gen/matplotlib.pyplot.boxplot.html#matplotlib.pyplot.boxplot

Page 40: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Don’t use pie charts!

But if you must…pyplot.pie(x, … )http://matplotlib.org/api/_as_gen/matplotlib.pyplot.pie.html#matplotlib.pyplot.pie

Pie Charts

A table is nearly always better than a dumb pie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial disarray both within and between charts [...] Given their low [information] density and failure to order numbers along a visual dimension, pie charts should never be used.

Edward TufteThe Visual Display of Quantitative Information

Page 41: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Subplotssubplot(nrows, ncols, plot_number)

Shorthand: subplot(XYZ)Makes an X-by-Y plotPicks out the Z-th plotCounting in row-major order

tight_layout() automatically tries to clean things up so that subplots don’t overlap. Without this command in this example, the labels “sqrt” and “logarithmic” overlap with the x-axis tick labels in the first row.

Page 42: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Specifying axis rangesplt.ylim([lower,upper]) sets y-axis limits

plt.xlim([lower,upper]) for x-axis

For-loop goes through all of the subplots and sets their y-axis limits

Page 43: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Nonlinear axesScale the axes with plt.xscale and plt.yscale

Built-in scales:Linear (‘linear’)Log (‘log’)Symmetric log (‘symlog’)Logit (‘logit’)

Can also specify customized scales: https://matplotlib.org/devel/add_new_projection.html#adding-new-scales

Page 44: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Saving imagesplt.savefig(filename) will try to automatically figure out what file type you want based on the file extension.

Can make it explicit using plt.savefig(‘filename’,

format=‘fmt’)

Other options for specifying resolution, padding, etc: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.savefig.html

Page 45: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

AnimationsMatplotlib.animate package generates animations

I won’t require you to make any, but they’re fun to play around with (and they can be a great visualization tool)

The details are a bit tricky, so I recommend starting by looking at some of the example animations here: http://matplotlib.org/api/animation_api.html#examples

Page 46: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Readings (this lecture)Required:

Numpy quickstart tutorial:https://docs.scipy.org/doc/numpy-dev/user/quickstart.html

Pyplot tutorial:http://matplotlib.org/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py

Recommended:The Visual Display of Quantitative Information by Edward TufteVisual and Statistical Thinking: Displays of Evidence for Making Decisions

by Edward TufteThis is a short book, essentially a reprint of Chapter 2 of the book above.

Pyplot API: http://matplotlib.org/api/pyplot_summary.html

Page 47: STATS 700-002 Data Analysis using Pythonpie chart; the only worse design than a pie chart is several of them, for then the viewer is asked to compare quantities located in spatial

Readings (next lecture)Required:

Introduction to Unix commands: https://kb.iu.edu/d/afskIncludes all the commands we discussed today, and a few more that you don’t need to know well, but are worth being aware of.

Recommended:Survival guide for Unix newbies: http://matt.might.net/articles/basic-unix/

More thorough discussion, including advanced commands like grep

“GNU/Linux Command−Line Tools Summary” by Gareth AndersonComprehensive introduction to the command line and the UNIX/Linux design philosophy in general.http://tldp.org/LDP/GNU-Linux-Tools-Summary/GNU-Linux-Tools-Summary.pdf