ece 592 topics in data science - nc state universityece 592 topics in data science dror baron...

38
ECE 592 Topics in Data Science Dror Baron Associate Professor Dept. of Electrical and Computer Engr. North Carolina State University, NC, USA

Upload: others

Post on 25-Apr-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

ECE 592Topics in Data Science

Dror BaronAssociate Professor

Dept. of Electrical and Computer Engr.North Carolina State University, NC, USA

Page 2: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Today’s class

2

About instructor

Advice for new graduate students

Course structure

Motivation for data science– What is data science?– Applications– Examples

Constantly improving course Please provide feedback

Page 3: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

About the instructor

3

Dr. Dror Baron– Email: [email protected]– Office: EB2 2097– Office hour: after Monday class or by appointment

At NC State since 2010

Also taught – ECE 308 (control)– ECE 421 (signal processing)– ECE 514 (random processes)– ECE 792 (universal algorithms in communication & signal proc.)

Research interests– Statistical signal processing & information theory– Information theoretic approaches to sparse signal processing– Recent interest in large scale iterative algorithms

Page 4: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Advice for new graduate students

4

Who here is new?

Welcome!

Many international graduate students in ECE– Hope you aren’t shocked – If you aren’t sure – ask!– Lots of cars drivers unaware of pedestrians be careful

Page 5: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Course Structure

Page 6: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Course structure

6

Course webpage – contains relevant materials

Main course components:– Providing feedback (might get a message board)– Prerequisites– Course purpose– Outline / main topics– Textbook(s)– Matlab and/or Python– Assignments (homeworks & projects)– Grade structure

We have a tentative schedule and syllabus

Page 7: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Feedback

7

Message board (maybe)

Email

Questions?

Page 8: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Prerequisites

8

Eager to learn about data science

Coursework:– ECE 421 (signal processing)– ST 371 (probability)

Comfortable with linear algebra & probability

Comfortable with programming– Big data big datasets must be fast– Matlab and/or Python– Will cover scientific programming

Page 9: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Course purpose

9

A big picture idea about data science– Probabilistic / information theoretic perspective– Scientific programming

What’s it good for?– Learning from data – Big data sets

Core techniques?

Components: math, computers, algorithms, data, …

Page 10: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Outline / main topics

10

Introduction/motivation Scientific programming

– Computational complexity, data structures, profiling

Optimization– Dynamic programming, linear programming, convex

optimization, integer programming, EM algorithm Machine learning basics

– Classification, clustering, regression

Sparse signal processing– Wavelets, sparse acquisition & reconstruction

Dimensionality reduction – Principle components analysis

Page 11: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

New in 2019!

11

2016: too much sparse signal processing

2017: less sparsity; started with machine learning (ML); students wanted more (and more, and more…) of that– End of semester: recommended to start with scientific

programming (realized they lacked knowledge there)

2018: started w/scientific programming– Background on probability & information theory (helps

understand ML); more optimization & ML; less sparsity

2019: improve projects/homeworks with TA

Page 12: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Textbook(s) and online references

12

No single textbook Borrowing from multiple sources:

– Bishop, Pattern Recognition and Machine Learning– MacKay, Information Theory, Inference, and Learning Algorithms– Mohri et al., Foundations of Machine Learning– Hastie et al., The Elements of Statistical Learning

Slides posted online– Typing details for new stuff as we go along– Please ask for extra supplemental material if helpful

Page 13: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Matlab and/or Python

13

Matlab: good for prototyping Python:

– Closer to normal programming language– Increasingly used in industry

Various languages used in data science– SAS, R, …– Core implementations often in C/C++, Java, …

Please download Matlab/Python to personal machines– Links (including tutorial) on webpage

Page 14: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Assignments

14

Homeworks– More math– Less programming– Every 2-3 weeks

Projects– 3-4 “homework style” projects– Integration of math, algorithm development, & programming– Oriented around application and data– Final project will focus on topic of specific interest to students; 2-3

students will submit report and present to class

Both HW & P submtted individually, in pairs, or triples

Page 15: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Tentative grade structure

15

Homework 15% Projects 20% Final project 20% Midterm 20% Final exam 25% (schedule determined by university)

– Note: 2 hour final exam

Extra credit 2-3%

Page 16: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Motivation for course?

16

Why take ECE 592? [Students suggest reasons for doing so; we discuss]

Page 17: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Motivation & Applications

Keywords: big data, data science

Page 18: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

What is data science?

18

Wikipedia:Data science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured,[1][2] which is a continuation of some of the data analysis fields such as statistics, data mining, and predictive analytics

Extract knowledge from data Multi-disciplinary (math, statistics, programming, …)

Page 19: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Why is it receiving attention?

19

Big data– Petabytes (1015 B) now commonplace– Often requires multiple processors

• Large amounts of storage• Clusters or GPU’s

Page 20: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Big data as societal feedback system

20

Can extract bigger profits from bigger data– Note: can replace “profit” by utility, societal benefit, etc.

improved computing capabilities

learn more from data

profits

process more data

provide better servicebuy more computers; spend more on R&D…

Page 21: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Application #1 – Click prediction

21

Show users ads online Paid for clicks Track various data related to each ad

– Ad topic, user history, geographic location, time of day, …

Better prediction more ad revenue

Personal anecdote– Read something about Audi– Lots of Audi commercials that week (creepy?)

Page 22: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Application #2 – Speech recognition

22

User speaks into phone Phone converts audio signal to speech We’re seeing more of this in automated call centers

Technical approach shifting from modeling speech (hidden Markov models) to training on lots of data– Major trend due to increasing computational power

Page 23: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Application #3 – Mortgage defaults

23

Consumers have mortgages on homes Some consumers stop paying (default) Bank loses $ Want to predict who defaults and how much

Similar to click prediction (binary classification) Possibly more complex (want default amount)

Similar: credit card payments

Page 24: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Application #4 – Financial prediction

24

Lots of financial assets (stocks, bonds, …) traded Data about different assets

– Company sector, profits, growth rate, R&D spending, past prices, …

Want to predict future prices Want to design portfolio that goes up with low

volatility (small fluctuations)

Page 25: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Application #5 – Games

25

Go – popular game in Asia Deep learning method trained on millions of games Beat Korean champion player

Old approach – program computer to play chess

New approach – let computer look at (lots of) games

Page 26: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Application #6 – Identify handwriting

26

Post office wants to recognize zip codes

Seems “easy”– Location of zip code on envelope can be identified– Can partition into individual digits– Only 10 digits

Typical approach – look at lots of data, compare individual digit to data base, choose nearest neighbor(s)

Page 27: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Application #7 – Autonomous cars

27

You heard about this, right?

Page 28: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Example: Polynomial Curve Fitting [Bishop – Sec. 1.1]

Keywords: curve fitting, least squares

Page 29: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Problem setting

29

Input variables x=(x1,…,xN)T

Observe noisy target variables t=(t1,…,tN)T

– Want to predict (future) target variables

Model for noisy observations: tn=sin(2πxn)+zn

Measurement noise zn

Want to perform polynomial curve fitting Find order-M polynomial that best-explains t

Page 30: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Why polynomial curve fitting?

30

Why might polynomial approximation to unknown function work?– Taylor series – approximate function w/polynomial

Maybe Fourier expansion is “better”– It is in this case

Side information about problem very useful– True function sparse in Fourier basis– Sometimes we have side information; sometimes not

Page 31: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

What does “best explains” mean?

31

Suppose polynomial weights w We predict y(x,w)=t’=w0+w1x+…+wMxM

Expect y(x,w)=t’≈t(x)

Let’s provide a score for w weights

𝐸𝐸 𝑤𝑤 = �𝑛𝑛=1

𝑁𝑁

{𝑦𝑦 𝑥𝑥𝑛𝑛,𝑤𝑤 − 𝑡𝑡𝑛𝑛}2

Want w that minimizes E(w)

Page 32: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Why squared error?

32

Our score for w sums over squared errors:

𝐸𝐸 𝑤𝑤 = �𝑛𝑛=1

𝑁𝑁

{𝑦𝑦 𝑥𝑥𝑛𝑛,𝑤𝑤 − 𝑡𝑡𝑛𝑛}2

Absolute error would emphasize “typical” errors, less emphasis on larger ones– Higher powers bring out outliers

Error metric may coincide to statistical distribution of possible noise (squared error implies Gaussian)

Page 33: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Math analysis (will revisit details later)

33

Can write𝑦𝑦(𝑥𝑥1,𝑤𝑤)

⋮𝑦𝑦(𝑥𝑥𝑁𝑁,𝑤𝑤)

=1 ⋯ 𝑥𝑥1𝑀𝑀⋮ ⋱ ⋮1 ⋯ 𝑥𝑥𝑁𝑁𝑀𝑀

𝑤𝑤0⋮𝑤𝑤𝑀𝑀

Shorthand y(x,w)=Xw (matrix vector product) Recall tn=sin(2πxn)+zn

Searching for vector w with minimal ||y(x,w)-t||2

– Recall ℓ𝑝𝑝 norm ||z||p=[Σn(zn)p]1/p

– Euclidean norm ||z||2=Σn(zn)2

Will study “least squares” finds w that minimizes ||Xw-t||2

– Closed form: w* = (XTX)-1XTt = X+t

Page 34: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Let’s check it out (Matlab on webpage)

34

N=10 noisy observations M=0 order

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1

-0.5

0

0.5

1

1.5

observations

truth

polynomial

Page 35: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Higher order?

35

N=10 noisy observations M=3 order

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1

-0.5

0

0.5

1

1.5

observations

truth

polynomial

Page 36: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Even higher order?

36

N=50 noisy observations M=20 order Overfitting!

Page 37: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

More observations?

37

N=1000 noisy observations M=3 order

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

observations

truth

polynomial

Page 38: ECE 592 Topics in Data Science - Nc State UniversityECE 592 Topics in Data Science Dror Baron Associate Professor. Dept. of Electrical and Computer Engr. North Carolina State University,

Discussion

38

More observations better curve fit (fixed M)

Small M const curve or linear curve (bad fit) Large M overfitting (polynomial will go crazy)

Challenge: How to estimate “good” M? Solution: test data

– Training data – for computing optimal weights w– Test data – check how well w explains remaining data– Find M that results in low error

Detailed discussion in book