cs 111: program design i · cs 111: program design i lecture 15: objects, pandas, modules robert h....
TRANSCRIPT
![Page 1: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/1.jpg)
CS 111: Program Design I Lecture 15: Objects, Pandas, Modules
Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016
![Page 2: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/2.jpg)
OBJECTS AND DOT NOTATION
![Page 3: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/3.jpg)
Objects
n (Implicit in Chapter 2, Variables, & 9.5, String Methods of book, but not explicit anywhere: So pay attention!)
n Everything in Python is an object n Object combines
q data (e.g., number, string, list) with q methods that can act on that object
![Page 4: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/4.jpg)
Methods
n Methods: like (special case of) function but not globally accessible
n Cannot call method just by giving its name, the way we call print(),open(),abs(),type(),range(), etc.
n Method: function that can only be accessed through an object q Using dot notation
![Page 5: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/5.jpg)
Dot notation
n To call method, use dot notation: q object_name.method()
n String example:
>>>test="Thisismyteststring">>>test.upper()'THISISMYTESTSTRING'
![Page 6: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/6.jpg)
If o is object of type having method do_it where do_it needs an input in addition to o, and x is defined, what is the proper way to call do_it? A. do_it(x) B. do_it(o, x) C. o.do_it(x) D. o.do_it(o, x)
![Page 7: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/7.jpg)
methods continued
>>>test.find("my")8>>>42.upper()SyntaxError:invalidsyntaxupper(test)barf
![Page 8: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/8.jpg)
Methods depend on type of object
n scdb.head() prints out 5 rows because head() is a method of objects of type Pandas dataframe, which is type of scdb object
n "test string".head() gives back an error because head is not a method of strings
![Page 9: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/9.jpg)
Methods' importance
n Understanding key data types depends on understanding their methods
n We saw many methods for strings n We have used the append method for lists,
and will come back to more list methods n file reference methods write(), read(),
readline(), readlines() n Pandas dataframe methods head(), tail(), etc.
![Page 10: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/10.jpg)
When you get to CS 341 & 342
n Or if you know Java or C++ now n methods are an Object Oriented (OO)
concept n In our CS 111
q We do need to know the basics of dot notation and methods
q We will otherwise be ignoring OO, and taking primarily a procedural approach
![Page 11: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/11.jpg)
PANDAS (FROM ANOTHER ANGLE)
![Page 12: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/12.jpg)
Pandas: What and Why
n High performance way to work with large dataframes
n Dataframe: The 2-d data structure most familiar from Excel spreadsheets, often with a header row
n Pandas built to play nicely with matplotlib for plotting (and incidentally NumPy and Scikit-Learn for machine learning)
![Page 13: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/13.jpg)
Why Pandas and not Excel
n Excel not designed for working with large datasets
n Chicago Crimes to 2008 to present file: q 1.04 million rows, 18 columns q Open file in Python: Instantaneous q pandas.read_csv(): 8 secs (Sloan’s 2013 laptop) q Open file in Excel: several minutes
n Just resize one column for better viewing: 5-30 sec
![Page 14: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/14.jpg)
Why Pandas and not Excel (2)
n Excel allows you to say/do/compute whatever is built into Excel
n Python is general purpose programming language: Can say/do/compute anything want, not limited to the functions Microsoft provides in Excel
n Geeky fine point: Anything that can be done with a computer. There are “uncomputable problems” (theory of computation CS 301, maybe special lecture in this class if time at end. Not really issue in data analytics)
![Page 15: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/15.jpg)
Pandas data types
n Most important: dataframe, which we are getting from pandas.read_csv() q 2-d array, with column headers
n Series: 1-d array, e.g., one column of a dataframe, is second most important
![Page 16: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/16.jpg)
Dataframe indexing
n frame[columnname] returns series from column with name columnname
n Giving the []s a list of names selects those columns in list order. E.g., q scdb[["justiceName","chief","docketID"]]
n Other indexing: .iloc, .loc q (also others we won't cover)
n Special case is that specifically a slice index to whole frame will slice by rows for convenience because it's a common operation, but inconsistent with overall Pandas syntax
![Page 17: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/17.jpg)
Dataframe positional slicing: iloc
n .iloc for 100% positional indexing and slicing with usual Python 0 to length-1 numbering (stands for "integer location")
n Arguments for both dimensions separated by comma [rows, cols]: q frame.iloc[:3,:4] upper left 3 rows/4 cols q frame.iloc[:,:3] all rows, first 3 cols
n One argument: rows q frame.iloc[3:6] second 3 rows q fame.iloc[42] 42nd row
![Page 18: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/18.jpg)
Dataframe label indexing: .loc
n Use .loc to access by labels, or mix q selection list will put columns in list's order;
selection set in {}s original dataframe order q scdb.loc[3:6,{'docketId','chief','justiceName'}]
n Rows 3 through 6 inclusive, columns in scdb's order q scdb.loc[3:6,['docketId','chief','justiceName']]
n Rows 3 through 6 inclusive, columns in order ['docketId', 'chief', 'justiceName']
n Notice loc uses slices inclusive of both ends, unlike all the rest of Python & Pandas
n .loc with only slices: error (e.g., foo.loc[3:6, 2:4])
![Page 19: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/19.jpg)
Dataframe and series methods
n head(): returns sub-dataframe (top rows) q or for series, first entries
n tail(): same, bottom rows q With no argument they default to 5 rows; can give
positive integer argument for number of rows n count(): For series, returns number of values
(excluding missing, NaN, etc.) q For dataframe, returns series, with count of each
column, labeled by column
![Page 20: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/20.jpg)
Dataframe and series methods (cont.)
n abs, max, mean, median, min, mode, sum q All behave like count, except will give errors if data
types don't support the operation q E.g., a series of strings does return good answer
with .max() method (based on alphabetical order), but cannot take .median()
![Page 21: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/21.jpg)
plotting
n Both DataFrame and Series have a plot() method (as do many other Pandas types)
n Must have loaded Python's plotting module, because Pandas is making use of it:
importmatplotlib.pyplotasplt
n Default is Series makes a line graph; DataFrame makes one line graph per column, and labels each line by column labels
![Page 22: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/22.jpg)
100% Optional: Aside for graph geeks
n Optional for fun: To change style of your plot: importmatplotlibmatplotlib.style.use('fivethirtyeight')#ORmatplotlib.style.use('ggplot') #Rstyle
n Out of the box, it's Matlab style, which some folks like a lot
![Page 23: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/23.jpg)
.plot() method
n Needs no arguments n Has optional arguments such as kind:
q .plot(kind='bar') for bar graphs q Many others including
n 'hist' for histogram n 'box' for box with whiskers n 'area' for stacked area plots n 'scatter' for scatter plots n 'pie' for pie charts
![Page 24: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/24.jpg)
.plot() x and y arguments
n If you have dataframe but want one column as x values and one as y values, can use optional argument(s)
n df.plot(x='Year') q Plot all columns but 'Year" as line graphs against
x being the Year column
![Page 25: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/25.jpg)
Brief demo: Chi murder rate by year
import matplotlib.pyplot as pltfrom matplotlib import styleimport pandas
f = open("Chicago murders to 2012HeaderRows.csv", "r")df = pandas.read_csv(f)plt.ion()
![Page 26: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/26.jpg)
groupby(label) method
n Idea: split dataframe into groups that all have same value in column named label. E.g., q grouped=scdb.groupby("justiceName") q grouped has many of same methods, indexing
options as a dataframe q grouped.count() à dataframe with 60 columns (all
but justice name) and 1 row per justiceName q grouped["docketID"] selects out that column q we plotted grouped["docketId"].count()
n it has a plot() method
![Page 27: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/27.jpg)
A series and series groupby method
n nunique() is a method of true series, where it gives the number of distinct values in the series (a number)
n nunique() is a method of a series-like groupby object, where it gives a true series: How many were in each group.
![Page 28: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/28.jpg)
PYTHON STANDARD LIBRARY & BEYOND: MODULES
![Page 29: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/29.jpg)
Extending Python
n Every modern programming language has way to extend basic functions of language with new ones
n Python: importing a module n module: Python file with new capabilities
defined in it n One you import module, it's as if you typed it
in: you get all functions, objects, variables defined it immediately
![Page 30: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/30.jpg)
Python Standard Library
n Python always comes with big set of modules n List at
https://docs.python.org/3/py-modindex.html n Examples
csv Read/write csv files datetime Basic date & time types math Math stuff (e.g., sin(), cos()) os E.g., list files in your operating system random random number generation urllib Open URLs, parse URLs
![Page 31: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/31.jpg)
Using Modules
n Use import<module_name> to make module's function's available
n Style: Put all import statements at top of file n After importmodule_name, access its
functions (and variables, etc.) through module_name.function_name
n If module_name is long, can abbreviate in import with as: q importmodule_nameasmnq mn.function_name
![Page 32: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/32.jpg)
If you prefer to save typing
n (I mostly do not do this) n To accessfunction_namewithout having
to type module_name prefix, use:
frommodule_nameimportfunction_name
![Page 33: CS 111: Program Design I · CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016](https://reader031.vdocuments.mx/reader031/viewer/2022041202/5d4d38ac88c993dd728b6e48/html5/thumbnails/33.jpg)
Common but not standard
n matplotlib and pandas are not part of set of modules that must come with every Python 3
n matplotlib is very, very widely used, and pandas is widely used
n Both are among the many modules that come with the Anaconda distribution of Python 3