welcome to ist 380 ! when the course was over, i knew it was a good thing. we don't have strong...

73
Welcome to IST 380 ! When the course was over, I knew it was a good thing. We don't have strong enough words to describe this class. Data Science Programming an advocate of concrete computing – and HMC's mascot - New York Times Review of Courses - US News and Course Report We give this course two thumbs! - Ebert and Roeper

Upload: judith-atkinson

Post on 17-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Welcome to IST 380 !

When the course was over, I knew it was a good thing.

We don't have strong enough words to describe this class.

Data Science Programming

an advocate of concrete computing – and HMC's mascot - New York Times Review of Courses

- US News and Course Report

We give this course two thumbs!- Ebert and Roeper

Welcome to IST 380 !

Data Science Programming

an advocate of concrete computing – and HMC's mascot

About myself

Who Zach Dodds

Harvey Mudd CollegeWhere

What Research includes robotics and computer vision

Contact Information

[email protected]

909-607-0867

Office Hours:Friday mornings, 9-11 am

or set up a time...

When Mondays 7-10pm here in ACB 119

HMC Beckman B111

TMI?

fan of low-tech gamesfan of low-level AI

IST 380 ~ the big picture

What is it? Why me?

IST 380 ~ the big picture

Data Science Venn

Diagram

Hmmm… where am I on this diagram?

What is it?

Data?!• Neighbor's name

• A place they consider home

• Are they working at a company now?

• How many U.S. states have they visited?

• Their favorite unhealthy food… ?

• Do they have any "Data Science" background?

(statistics, machine learning, CS)

Where?

state reminders…

Data! • Neighbor's name

• A place they consider home

• Are they working at a company now?

• How many U.S. states have they visited?

• Their favorite unhealthy food… ?

• Do they have any "Data Science" background?

(statistics, machine learning, CS)

Zachary Dodds

Pittsburgh, PA

Harvey MuddWhere?

44

mostly CS for me…

M&Ms

Data! • Neighbor's name

• A place they consider home

• Are they working at a company now?

• How many U.S. states have they visited?

• Their favorite unhealthy food… ?

• Do they have any "Data Science" background?

(statistics, machine learning, CS)

Zachary Dodds

Pittsburgh, PA

Harvey MuddWhere?

44

mostly CS for me…

M&Ms

be sure to set up your login + profile for the submission site…

This class is truly seminar-

style: I'm here, as you are,

in order to gain insights

into this very new field… .

Data Science concerns

Is "Data Science" important or just

trendy?

Hmmm…

Data Science concerns

the companies are expanding as fast as the data!

There's certainly a lot of it!

2015

1 Zettabyte

1 Exabyte

1 Petabyte

(brain) 14 PB: http://www.quora.com/Neuroscience-1/How-much-data-can-the-human-brain-store

(2002) 5 EB: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm

1 Petabyte == 1000 TB 2002 2009

(2009) 800 EB: http://www.emc.com/collateral/analyst-reports/idc-digital-universe-are-you-ready.pdf

(2015) 8 ZB: http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf

2006 2011

(2006) 161 EB: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf

(2011) 1.8 ZB: http://www.emc.com/leadership/programs/digital-universe.htm (life in video) 60 PB: in 4320p resolution, extrapolated from 16MB for 1:21 of 640x480 video (w/sound) – almost certainly a gross overestimate, as sleep can be compressed significantly!

5 EB

161 EB

800 EB

1.8 ZB 8.0 ZB

14 PB

60 PB

Data produced each year

100-years of HD video + audioHuman brain's capacity

Data, data everywhere…

References

1 TB = 1000 GB

120 PB

logarithmic scale

data

information

knowledge

wisdom

I'd call it data, not information

Big Data?

I agree with this…

Make data easier to use ~ by using it!

It may be true that Data Science isn't

a science – but that doesn't mean

it's not useful!

IST 380 ~ the big picture

What? Why?Data Science Programming Data Rules

All of our insights – large and small, permanent and ephemeral, natural and artificial – come about

through the integration of lots of data.

Data Science simply recognizes that the rules and skills behind those insights are widely applicable…

A few examples…

Make3d

How is this being done?

Andrew Ng ~ Computers and Thought award,

2009

… Data Science is at the heart of computer science

and how do we succeed?

A few examples…

… Data Science is at the heart of computer science

Stanford's Autonomous

Vehicles project (Thrun et al.)

Learning to Powerslide

A few examples…

… Data Science is at the heart of computer science

"my summer was finding that red line"

Learning ground from obstacles

A few examples…

Learning ground from obstacles

classification segmentation

Insights beyond science

Marketing

Visualization

Motivation

Recommender Systems

predicting movie ratings

Bob Bell, winner of the "Netflix prize"

Napoleon Dynamite =Batman Begins =

Netflix Prize

Finding Nemo =Lord of the Rings =

(I don't know this guy)

1.22.75

????

Some films are difficult to predict…

Bob Bell, winner of the "Netflix prize"(I don't know this guy)

Napoleon Dynamite =Batman Begins =

Finding Nemo =Lord of the Rings =

1.22.75

.67

.42Some films are difficult to predict… and others are easier!

Netflix Prize

Why IST 380 ?Specific skills:

R statistical environment (and the S programming language)

Experience with several statistical analyses (descriptive statistics)

Experience with predictive statistics (modeling) and machine learning algorithms

Why IST 380 ?Specific skills:

Broad background:

You'll be confident and capable with whatever datasets you encounter in the future – on your own or as part of a team.

R statistical environment (and the S programming language)

Experience with several statistical analyses (descriptive statistics)

Experience with predictive statistics (modeling) and machine learning algorithms

Final project ~ open-ended with datasets of your choice

About IST 380 …

DetailsWeb Page:

http://www.cs.hmc.edu/~dodds/IST380

Assignments, online text, necessary files, lecture slides are linked

First week's assignment: Getting started with R

Programming: R

Textbook An introduction to Data Sciencejsresearch.net/groups/teachdatascience/

www.r-project.org/

Grab both of these now…

freely available online

and many online resources…

Homepage

http://www.cs.hmc.edu/~dodds/IST380/Go to the course page

Grab R and the text from these two links…

Homework

Assignments~ 2-5 problems/week ~ 100 points extra credit, often

Due Tuesday of the following week by 11:59 pm.

Assignment 1 due Tuesday, February 5.

1 week + 1 day…

Homework

Working on programs: On your own or in groups of 2.

Divide the work at the keyboard evenly!

Submitting programs: at the submission website

Today's Lab: install software ensure accounts are workingtry out R - the first HW is officially due on 2/5

Assignments~ 2-5 problems/week ~ 100 points extra credit, often

Due Tuesday of the following week by 11:59 pm.

Assignment 1 due Tuesday, February 5.

Outline

Weeks 1-5

using R

descriptive statistics

predictive statistics

probability distributions

Weeks 6-10

"Data Science"

"Machine Learning"

statistical modelingsupport vector machines (SVMs)

random forestsk-means algorithm

nearest neighbors (NN)

Weeks 11-15

approximate!

Final Project

No breaks?!

Grading

Grades

Final project

if score >= 0.95: grade = "A"if score >= 0.90: grade = "A-"if score >= 0.86: grade = "B+"

• the last ~4 weeks will work towards a larger, final project

• there will be a short design phase and a short final presentation

• I'd encourage you to connect R and our Data Science techniques to other datasets or projects that you use/need/like, etc.

Based on points percentage ~ 800 points for assignments

see the course syllabus for the full list...~ 400 points for the final project

• choose your own problem to study (I'll have some suggestions, too.)

Academic Honesty

This course operates under CGU's (and all of Claremont Schools') Academic Honesty policies…

•Your work must be your own. This must be true for the whole team, if you're working in a pair.

•Consulting with others (except team members or myself) is encouraged, but has to be limited to discussion and debugging of problems. Sharing of written, electronic, or verbal solutions/files/code is a violation of CGU’s academic honesty policy.

•A reasonable guideline: Work is your own if you could delete all of it and recreate it yourself.

Thoughts?

Getting to know… R

Getting to know… R

http://lang-index.sourceforge.net/#categ

R is the programmer's toolkit for statistics; SAS, Stata, SPSS are preferred by those in business intelligence

Getting to know… R

Free… and very well supported online…

Getting to know… R

R is responsive, up-to-date, and flexible: Data Science vs. Statistics

Getting to know… R

1) Find the IST 380 course webpage

Try it!

www.cs.hmc.edu/~dodds/IST380/

2) Download and install R

3) Run R and try some basic commands at the prompt:

6 * 7

rnorm(10)

x <- 380

Getting started!

1) Open Matloff's Why R? notes

2) Skip ahead to page 7, the "5 minute example session"

3) Try out the commands in section 2.2 to get started…

4) When you finish, save your session and submit it!

This is problem 1 this week

Saving your session

2) Use the Save to file… (Windows) or Save as… (Mac) in order to save your current console session into hw1

This is problem 1 this week

1) Create a folder named hw1, perhaps on your desktop

3) Name that file pr1.txt

4) From your operating system, open up that file in order to confirm it contains your whole session!

Submitting your work

2) From the course webpage, click on the submission site link.

You've completed Problem 1!

1) Zip up hw1 into hw1.zip

3) Choose a submission site login name & let me know!

4) Once your account is made, login, change your password to something you know, and submit hw1.zip

This webserver can be spacey -- I should

know!

troubles? email me!5) You can submit again – all copies are saved…

Reflection

Average and standard deviation?

Assignment?

Comments?

Printing?

Comments?

Creating a vector?

R types

You can use mode() to view the type of a variable.

Where's the big data?

Vectors are R lists of a single type of element

c ~ concatenate

Where's the big data?

Vectors are R lists of a single type of element

c ~ concatenate

the colon : also creates vectors

Analyzing vectors – try these…

Square brackets [] can "subset" (or "slice") vectors

Analyzing vectors

Square brackets [] can "subset" (or "slice") vectors

you can use a boolean vector

to subset another vector

NA

R uses NA to represent data that is "not available"

What is going on here?

The function is.na( ) tests for NA

NA

R uses NA to represent data that is "not available"

What is going on here?

The function is.na( ) tests for NA

This uses subsetting to remove NA values!

Data frames

R's fundamental data structures are data frames

The next tutorial will introduce them…

Irises…

setosavirginica

data() yields many built-in data files. This is iris

Subsetting iris data

As with vectors, you can "subset" data frames.

df[rows,cols]

Lab…

The 2nd part of each class meeting dedicated to lab work.

I welcome you to stay for the lab, but it is not required.

Today's lab:

Work through Santorico and Shin's Tutorial for the R Statistical Package and submit the console sessions as pr2_1.txt, pr2_1.txt, pr2_1.txt, pr2_1.txt, and pr2_1.txt.

This is a nice reinforcement of vectors, introduction to data frames, and a look at the graphics that R supports.

Homework

Problem 3: Challenge exercises in R

These will reinforce the "subsetting" and data-analysis introduction from pr2's tutorial.

Problem 4: Introduction to Data Science, early chapters

This is a fuller background on R and the field of data science

(submit your console session for both of these…)

Lab !

CS vs. IS and IT ?

www.acm.org/education/curric_vols/CC2005_Final_Report2.pdf

greater integration system-wide issues

smaller details machine specifics

CS vs. IS and IT ?

Where will IS go?

CS vs. IS and IT ?

IT ?

Where will IT go?

IT ?

The bigger picture

Weeks 10-12

Objects

Week 10

Week 11

Week 12

Weeks 13-15

Final Projects

classes vs. objects

methods and data

inheritance

Week 13

Week 14

Week 15

final projects

final projects

final exam

Data?!• Neighbor's name

• A place they consider home

• Are they working at a company now?

• How many U.S. states have they visited?

• Their favorite unhealthy food… ?

• Do they have any "Data Science"

(statistics, machine learning, CS)

background?

Where?

state reminders…

Data! • Neighbor's name

• A place they consider home

• Are they working at a company now?

• How many U.S. states have they visited?

• Their favorite unhealthy food… ?

• Do they have any "Data Science"

(statistics, machine learning, CS)

background?

Zachary Dodds

Pittsburgh, PA

Harvey MuddWhere?

44

mostly CS for me…

M&Ms

Data! • Neighbor's name

• A place they consider home

• Are they working at a company now?

• How many U.S. states have they visited?

• Their favorite unhealthy food… ?

• Do they have any "Data Science"

(statistics, machine learning, CS)

background?

Zachary Dodds

Pittsburgh, PA

Harvey MuddWhere?

44

mostly CS for me…

M&Ms

be sure to set up your login + profile for the submission site…

This class is truly seminar-style:

we're devloping expertise in this field together.