cs 111: program design i · cs 111: program design i lecture 15: objects, pandas, modules robert h....

33
CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

Upload: duonghanh

Post on 09-Aug-2019

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

CS 111: Program Design I Lecture 15: Objects, Pandas, Modules

Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

Page 2: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

OBJECTS AND DOT NOTATION

Page 3: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

Objects

n  (Implicit in Chapter 2, Variables, & 9.5, String Methods of book, but not explicit anywhere: So pay attention!)

n  Everything in Python is an object n  Object combines

q  data (e.g., number, string, list) with q  methods that can act on that object

Page 4: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

Methods

n  Methods: like (special case of) function but not globally accessible

n  Cannot call method just by giving its name, the way we call print(),open(),abs(),type(),range(), etc.

n  Method: function that can only be accessed through an object q  Using dot notation

Page 5: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

Dot notation

n  To call method, use dot notation: q  object_name.method()

n  String example:

>>>test="Thisismyteststring">>>test.upper()'THISISMYTESTSTRING'

Page 6: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

If o is object of type having method do_it where do_it needs an input in addition to o, and x is defined, what is the proper way to call do_it? A.  do_it(x) B.  do_it(o, x) C.  o.do_it(x) D.  o.do_it(o, x)

Page 7: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

methods continued

>>>test.find("my")8>>>42.upper()SyntaxError:invalidsyntaxupper(test)barf

Page 8: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

Methods depend on type of object

n  scdb.head() prints out 5 rows because head() is a method of objects of type Pandas dataframe, which is type of scdb object

n  "test string".head() gives back an error because head is not a method of strings

Page 9: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

Methods' importance

n  Understanding key data types depends on understanding their methods

n  We saw many methods for strings n  We have used the append method for lists,

and will come back to more list methods n  file reference methods write(), read(),

readline(), readlines() n  Pandas dataframe methods head(), tail(), etc.

Page 10: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

When you get to CS 341 & 342

n  Or if you know Java or C++ now n  methods are an Object Oriented (OO)

concept n  In our CS 111

q  We do need to know the basics of dot notation and methods

q  We will otherwise be ignoring OO, and taking primarily a procedural approach

Page 11: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

PANDAS (FROM ANOTHER ANGLE)

Page 12: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

Pandas: What and Why

n  High performance way to work with large dataframes

n  Dataframe: The 2-d data structure most familiar from Excel spreadsheets, often with a header row

n  Pandas built to play nicely with matplotlib for plotting (and incidentally NumPy and Scikit-Learn for machine learning)

Page 13: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

Why Pandas and not Excel

n  Excel not designed for working with large datasets

n  Chicago Crimes to 2008 to present file: q  1.04 million rows, 18 columns q  Open file in Python: Instantaneous q  pandas.read_csv(): 8 secs (Sloan’s 2013 laptop) q  Open file in Excel: several minutes

n  Just resize one column for better viewing: 5-30 sec

Page 14: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

Why Pandas and not Excel (2)

n  Excel allows you to say/do/compute whatever is built into Excel

n  Python is general purpose programming language: Can say/do/compute anything want, not limited to the functions Microsoft provides in Excel

n  Geeky fine point: Anything that can be done with a computer. There are “uncomputable problems” (theory of computation CS 301, maybe special lecture in this class if time at end. Not really issue in data analytics)

Page 15: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

Pandas data types

n  Most important: dataframe, which we are getting from pandas.read_csv() q  2-d array, with column headers

n  Series: 1-d array, e.g., one column of a dataframe, is second most important

Page 16: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

Dataframe indexing

n  frame[columnname] returns series from column with name columnname

n  Giving the []s a list of names selects those columns in list order. E.g., q  scdb[["justiceName","chief","docketID"]]

n  Other indexing: .iloc, .loc q  (also others we won't cover)

n  Special case is that specifically a slice index to whole frame will slice by rows for convenience because it's a common operation, but inconsistent with overall Pandas syntax

Page 17: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

Dataframe positional slicing: iloc

n  .iloc for 100% positional indexing and slicing with usual Python 0 to length-1 numbering (stands for "integer location")

n  Arguments for both dimensions separated by comma [rows, cols]: q  frame.iloc[:3,:4] upper left 3 rows/4 cols q  frame.iloc[:,:3] all rows, first 3 cols

n  One argument: rows q  frame.iloc[3:6] second 3 rows q  fame.iloc[42] 42nd row

Page 18: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

Dataframe label indexing: .loc

n  Use .loc to access by labels, or mix q  selection list will put columns in list's order;

selection set in {}s original dataframe order q  scdb.loc[3:6,{'docketId','chief','justiceName'}]

n  Rows 3 through 6 inclusive, columns in scdb's order q  scdb.loc[3:6,['docketId','chief','justiceName']]

n  Rows 3 through 6 inclusive, columns in order ['docketId', 'chief', 'justiceName']

n  Notice loc uses slices inclusive of both ends, unlike all the rest of Python & Pandas

n  .loc with only slices: error (e.g., foo.loc[3:6, 2:4])

Page 19: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

Dataframe and series methods

n  head(): returns sub-dataframe (top rows) q  or for series, first entries

n  tail(): same, bottom rows q  With no argument they default to 5 rows; can give

positive integer argument for number of rows n  count(): For series, returns number of values

(excluding missing, NaN, etc.) q  For dataframe, returns series, with count of each

column, labeled by column

Page 20: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

Dataframe and series methods (cont.)

n  abs, max, mean, median, min, mode, sum q  All behave like count, except will give errors if data

types don't support the operation q  E.g., a series of strings does return good answer

with .max() method (based on alphabetical order), but cannot take .median()

Page 21: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

plotting

n  Both DataFrame and Series have a plot() method (as do many other Pandas types)

n  Must have loaded Python's plotting module, because Pandas is making use of it:

importmatplotlib.pyplotasplt

n  Default is Series makes a line graph; DataFrame makes one line graph per column, and labels each line by column labels

Page 22: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

100% Optional: Aside for graph geeks

n  Optional for fun: To change style of your plot: importmatplotlibmatplotlib.style.use('fivethirtyeight')#ORmatplotlib.style.use('ggplot') #Rstyle

n  Out of the box, it's Matlab style, which some folks like a lot

Page 23: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

.plot() method

n  Needs no arguments n  Has optional arguments such as kind:

q  .plot(kind='bar') for bar graphs q  Many others including

n  'hist' for histogram n  'box' for box with whiskers n  'area' for stacked area plots n  'scatter' for scatter plots n  'pie' for pie charts

Page 24: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

.plot() x and y arguments

n  If you have dataframe but want one column as x values and one as y values, can use optional argument(s)

n  df.plot(x='Year') q  Plot all columns but 'Year" as line graphs against

x being the Year column

Page 25: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

Brief demo: Chi murder rate by year

import matplotlib.pyplot as pltfrom matplotlib import styleimport pandas

f = open("Chicago murders to 2012HeaderRows.csv", "r")df = pandas.read_csv(f)plt.ion()

Page 26: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

groupby(label) method

n  Idea: split dataframe into groups that all have same value in column named label. E.g., q  grouped=scdb.groupby("justiceName") q  grouped has many of same methods, indexing

options as a dataframe q  grouped.count() à dataframe with 60 columns (all

but justice name) and 1 row per justiceName q  grouped["docketID"] selects out that column q  we plotted grouped["docketId"].count()

n  it has a plot() method

Page 27: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

A series and series groupby method

n  nunique() is a method of true series, where it gives the number of distinct values in the series (a number)

n  nunique() is a method of a series-like groupby object, where it gives a true series: How many were in each group.

Page 28: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

PYTHON STANDARD LIBRARY & BEYOND: MODULES

Page 29: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

Extending Python

n  Every modern programming language has way to extend basic functions of language with new ones

n  Python: importing a module n  module: Python file with new capabilities

defined in it n  One you import module, it's as if you typed it

in: you get all functions, objects, variables defined it immediately

Page 30: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

Python Standard Library

n  Python always comes with big set of modules n  List at

https://docs.python.org/3/py-modindex.html n  Examples

csv Read/write csv files datetime Basic date & time types math Math stuff (e.g., sin(), cos()) os E.g., list files in your operating system random random number generation urllib Open URLs, parse URLs

Page 31: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

Using Modules

n  Use import<module_name> to make module's function's available

n  Style: Put all import statements at top of file n  After importmodule_name, access its

functions (and variables, etc.) through module_name.function_name

n  If module_name is long, can abbreviate in import with as: q  importmodule_nameasmnq  mn.function_name

Page 32: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

If you prefer to save typing

n  (I mostly do not do this) n  To accessfunction_namewithout having

to type module_name prefix, use:

frommodule_nameimportfunction_name

Page 33: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016

Common but not standard

n  matplotlib and pandas are not part of set of modules that must come with every Python 3

n  matplotlib is very, very widely used, and pandas is widely used

n  Both are among the many modules that come with the Anaconda distribution of Python 3