cs535 big data 1/23/2019 week 1-a sangmi lee pallickaracs535/slides/week1-a.pdf1/23/2019 cs535 big...

7
CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 1 Week 1-A-0 CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 1. INTRODUCTION TO BIG DATA Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 What is Big Data? 1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-1 Big Data Things one can do at a large scale that cannot be done at a smaller one To extract new insights Create new forms of values Big Data is about analytics of huge quantities of data in order to infer probabilities Big Data is NOT about trying to “teach” a computer to “think” like humans Providing a quantitative dimension it never had before 1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-2 The three(or four) Vs in Big Data Volume Voluminous It does not have to be certain number of petabytes or quantity. Velocity How fast the data is coming in? How fast you need to be able to analyze and utilize it Variety Number of sources or incoming vectors Veracity Can you trust the data itself, source of the data, or the process? User entry errors, redundancy, corruption of the values Data cleaning 1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-3 Related research areas Storage systems How can we efficiently resolve queries on massive amounts of input data? The input dataset may be presented in the form of a distributed data stream Machine learning How can we efficiently solve large-scale machine learning problems? The input data may be massive; stored in a distributed cluster of machines Distributed computing How can we efficiently solve large-scale optimization problems in distributed computing environments? For example, how can we efficiently solve large-scale combinatorial problems, e.g. processing of large scale graphs? 1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-4 Who is using Big Data? 1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-5

Upload: others

Post on 17-Sep-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickaracs535/slides/week1-A.pdf1/23/2019 CS535 Big Data -Spring 2019 Computer Science Department, Colorado State University Week 1-A31

CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 1

Week 1-A-0

CS535 BIG DATA

PART A. BIG DATA TECHNOLOGY1. INTRODUCTION TO BIG DATA

Sangmi Lee PallickaraComputer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs535

What is Big Data?

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-1

Big Data• Things one can do at a large scale that cannot be done at a smaller one

• To extract new insights• Create new forms of values

• Big Data is about analytics of huge quantities of data in order to infer probabilities• Big Data is NOT about trying to “teach” a computer to “think” like humans

• Providing a quantitative dimension it never had before

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-2

The three(or four) Vs in Big Data• Volume

• Voluminous• It does not have to be certain number of petabytes or quantity.

• Velocity• How fast the data is coming in?• How fast you need to be able to analyze and utilize it

• Variety • Number of sources or incoming vectors

• Veracity• Can you trust the data itself, source of the data, or the process?• User entry errors, redundancy, corruption of the values• Data cleaning

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-3

Related research areas• Storage systems

• How can we efficiently resolve queries on massive amounts of input data?• The input dataset may be presented in the form of a distributed data stream

• Machine learning• How can we efficiently solve large-scale machine learning problems?• The input data may be massive; stored in a distributed cluster of machines

• Distributed computing• How can we efficiently solve large-scale optimization problems in distributed computing environments?• For example, how can we efficiently solve large-scale combinatorial problems, e.g. processing of large

scale graphs?

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-4

Who is using Big Data?

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-5

Page 2: CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickaracs535/slides/week1-A.pdf1/23/2019 CS535 Big Data -Spring 2019 Computer Science Department, Colorado State University Week 1-A31

CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 2

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-6 1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-7

Photo Credit:https://datafloq.com/read/car-manufacturers-are-using-big-data/1204

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-8

Connected cars• Single hybrid plug-in car generates up to 25 gigabytes per hour

• Connected cars• $130 billion by 2019• Traffic problem, re-routing based on the volume of traffic• Alerts driver when a road conditions are hazardous by automatically activating anti-lock break

• This information is shared by the vehicles that are nearby

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-9

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-10

The Artemis project: Saving “preemies” using Big Data• The Artemis project

• Dr. Carolyn McGregor

• Toronto’s Hospital for Sick Children, University of Ontario Institute of Technology and IBM• Captures and process the patients’ data in real time

• 16 different data streams

• Heart rate, respiration rate, temperature, blood pressure and blood oxygen level• Around 1,260 data points per second

• System detects subtle changes that may signal the onset of infection 24 hours before overt symptoms appear

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-11

Page 3: CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickaracs535/slides/week1-A.pdf1/23/2019 CS535 Big Data -Spring 2019 Computer Science Department, Colorado State University Week 1-A31

CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 3

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-12

Look Who’s Peeking at Your Paycheck• Experian’s Income Insight

• Estimates people’s income level• Based on their credit history • Trains the estimation model using selected credit history and tax information from IRS

KAREN BLUMENTHAL, “Look W ho’s Peeking at Your Paycheck”, The Wall Street Journal, Jan. 13, 2010, http://www.wsj.com/articles SB10001424052748703672104574654211904801106

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-13

Week 1-A-14

CS535 BIG DATA

PART A. BIG DATA TECHNOLOGY2. COURSE INTRODUCTION

Sangmi Lee PallickaraComputer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs535

My Big Data Lab at Colorado State University• Algorithmic and systems design

• Scalable analytics over voluminous datasets on complex distributed architectures

• Research has been deployed in the following domains• Precision agriculture, atmosphere science, environmental biology, ecology, civil engineering, bioinformatics, and public

health

• Awards• Cochran Family Professorship 2018-2021• IEEE TCSC Award for Excellence in Scalable Computing (Mid-Career Researcher) 2018• National Science Foundation CAREER Award 2016

• Funded by • The National Science Foundation• The Advanced Research Projects Agency-Energy (Department of Energy) • Department of Homeland Security• The Environmental Defense Fund• Google, Amazon, and Hewlett Packard

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-15

People

Saptashwa Mitra

Bibek Shrestha Aaron PereiraKartik Khurana Paahuni Khandelwal

Sangmi Pallickara

Undergraduate researchers, Spring 2018

Walid BudgagaDan Rammer Ryan Becwar

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-16

Communications• Course Website

• http://www.cs.colostate.edu/~cs535• Announcements: Check the course website at least twice a week.• Schedule (course materials, readings, assignments)• Policies

• Canvas• Assignment submission• Discussion board• Grades

• Contact Me• [email protected]• Office hour: Friday 10:00AM ~ 11:00AM and by appointment• Office: CSB456• URL: http://www.cs.colostate.edu/~sangmi

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-17

Page 4: CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickaracs535/slides/week1-A.pdf1/23/2019 CS535 Big Data -Spring 2019 Computer Science Department, Colorado State University Week 1-A31

CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 4

Goal of this course • Understanding fundamental concepts in Big Data Analytics • Learn about existing technologies and how to apply them

Computing systems

Algorithms and models

Analytics

Predictive models

Graph models

Storage systems and middle ware

Computing frameworks

Specialized modeling tools

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-18

Course Structure

Big Data Technology

Week 1 ~ Week 6

Kernel I: Big Data Analytics Advanced Case Study

Week 7, 9

Kernel II: Scalable Analytics Algorithms

Week 10, 11

Kernel III: Large Scale Graph Analysis

Week 12, 13

Kernel IV: Big Data Storage and Analytics

Week 14, 15

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-19

Course Structure | Part A: Big Data Technology• Big Data Technology

• Purposes• Understand concepts of Big Data computing environment • Hands-on experience

• Topics• Introduction to Big Data• Lambda Model

• Distributed file system• Quick view of MapReduce • Introduction to Apache Spark• Analytics with Apache Storm

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-20

Course Structure | Part B: Kernel I ~ IV• Purpose

• Understanding different aspects of Big Data research with lectures and workshops

• Kernels• Kernel I: Big Data Advanced Analytics Case Study• Kernel II: Scalable Analytics Algorithms• Kernel III: Large scale Graph Analysis• Kernel IV: Big Data Storage with Analytics

• Duration: 2 weeks• 3 (x 75 minutes) Lectures• 1 (x 75 minutes) workshop

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-21

Course Structure | Part B: Kernel I ~ IV• Workshop

• Students present leading edge research papers • 3-4 papers per workshop• We will have 4 workshops

• Team Requirements• Presenter

• 1-2 papers

• Report and presentation

• Reader• 3-4 papers

• Report

• Participation

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-22

Course Component | Programing Assignments• Programming Assignment 1

• Implementing link analysis algorithm over the Wikipedia pages using Apache Spark

• Due on 2/26 5:00PM

• The description of PA1 is available now.

• Programming Assignment 2

• Implementing real-time Twitter stream analysis using Apache Storm

• Due on 3/26 5:00PM

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-23

Page 5: CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickaracs535/slides/week1-A.pdf1/23/2019 CS535 Big Data -Spring 2019 Computer Science Department, Colorado State University Week 1-A31

CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 5

Course Component | Programing Assignments• All of the programming assignments are group submission• Submission should be via canvas• Late policy for the assignment submissions

• Up to a maximum of 2 day past the deadline. • 10% penalty per day will be applied

• Each group will provide demo of the programming assignment in CSB120• Each assignment will count 10% of total score of this course

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-24

Course Component | Quizzes• 7-8 Quizzes• Two lowest scores will be eliminated• Quizzes will count 20% of total score of this course

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-25

Course Component | Term Project• Objectives

• Students identify their topics for the term project• Students provide methodology to solve their problem• Students implement software solution• Students provide evaluation scheme for their software

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-26

Course Component | Term Project• Term project grading (40% of this course)

• Term project planning: 1% (e.g. team members, a tentative title of your project)• Proposal document: 4%• Proposal presentation (peer review) : 2%• Final paper: 24% • Final demonstration: 2%• Final presentation (peer review): 3%• Participation: 4%

• Late policy for the deliverable submissions• Up to a maximum of 2 day past the deadline. • 10% penalty per day will be applied

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-27

Course Component | Term Project• Highlights of the Previous Term Projects

• Supporting Emergency Response During Natural Disasters with Twitter Data

• Winning Words in the Supreme Court

• Mendel: A Distributed Storage System for Efficient Sequence Alignment and Similarity Searching

(published in IEEE IPDPS 2016)

• Processing Smart Grid Data In Real Time (DEBS grand challenge 2014)

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-28

Course Component | Term Project• Highlights of the Previous Term Projects

• Time to Answer” for Questions on stackoverflow.com using Map Reduce

• Analysis of words for spell checking in search queries using digitized books and articles

• Efficient Boolean Symmetric Searchable Encryption

• Big Data I/O Performance Improvement Using Buffered B-Tree Algorithm

• Node and Metadata Visualization in a Distributed Hash Table

• Who is Building Wikipedia?

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-29

Page 6: CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickaracs535/slides/week1-A.pdf1/23/2019 CS535 Big Data -Spring 2019 Computer Science Department, Colorado State University Week 1-A31

CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 6

Grading | OverviewCategory Percentage of Final GradeProgramming Assignments 20%

Assignment 1: 10% (1% participation included)Assignment 2: 10% (1% participation included)

Term Project 40%D0: Term project planning: 1%D1: Proposal document: 4%D1: Proposal presentation: 2%D2: Final paper: 24% D2: Final demonstration:2%D3: Final presentation: 3%Participation: 4%

Quizzes 20%

Workshop 20% (2% participation included)

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-30

Grading | Participation Scores

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-31

10% of these scores will be team peer-reviewed participation score

Category Percentage of Final GradeProgramming Assignments 20%

Assignment 1: 10% (1% participation included)Assignment 2: 10% (1% participation included)

Term Project 40%D0: Term project planning: 1%D1: Proposal document: 4%D1: Proposal presentation: 2%D2: Final paper: 24% D2: Final demonstration:2%D3: Final presentation: 3%Participation: 4%

Quizzes 20%

Workshop 20% (2% participation included)

Grading | Final Letter Grade

Letter Grade Total PercentageA 90.00 % and higher

B 80.00 ~ 89.99 %

C 70.00 ~ 79.99 %

D 60.00 ~ 69.99 %

F Below 60.00 %

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-32

Course Component | Course Policy• No make-up for missed quizzes and exams

• Except for the case where student provided an advance written notice to the instructor based on an emergency• Supporting paper works will be requested

• Two lowest quiz scores will be eliminated at the end of semester

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-33

Course Component | Course Policy• No Cell-phones in the class.

• No Laptops in the class.

• If you need to use a laptop during lectures, please sit in the back row.

• I will ask you to turn off your laptop if needed.

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-34

More importantly,• Attend the class, ask questions, and discuss• Check the course web page and canvas regularly • Try new technologies and apply them• Share your experiences with other students in class

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-35

Page 7: CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickaracs535/slides/week1-A.pdf1/23/2019 CS535 Big Data -Spring 2019 Computer Science Department, Colorado State University Week 1-A31

CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 7

Questions?

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-36