Transcript
Page 1: CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickaracs535/slides/week1-A.pdf1/23/2019 CS535 Big Data -Spring 2019 Computer Science Department, Colorado State University Week 1-A31

CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 1

Week 1-A-0

CS535 BIG DATA

PART A. BIG DATA TECHNOLOGY1. INTRODUCTION TO BIG DATA

Sangmi Lee PallickaraComputer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs535

What is Big Data?

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-1

Big Data• Things one can do at a large scale that cannot be done at a smaller one

• To extract new insights• Create new forms of values

• Big Data is about analytics of huge quantities of data in order to infer probabilities• Big Data is NOT about trying to “teach” a computer to “think” like humans

• Providing a quantitative dimension it never had before

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-2

The three(or four) Vs in Big Data• Volume

• Voluminous• It does not have to be certain number of petabytes or quantity.

• Velocity• How fast the data is coming in?• How fast you need to be able to analyze and utilize it

• Variety • Number of sources or incoming vectors

• Veracity• Can you trust the data itself, source of the data, or the process?• User entry errors, redundancy, corruption of the values• Data cleaning

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-3

Related research areas• Storage systems

• How can we efficiently resolve queries on massive amounts of input data?• The input dataset may be presented in the form of a distributed data stream

• Machine learning• How can we efficiently solve large-scale machine learning problems?• The input data may be massive; stored in a distributed cluster of machines

• Distributed computing• How can we efficiently solve large-scale optimization problems in distributed computing environments?• For example, how can we efficiently solve large-scale combinatorial problems, e.g. processing of large

scale graphs?

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-4

Who is using Big Data?

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-5

Page 2: CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickaracs535/slides/week1-A.pdf1/23/2019 CS535 Big Data -Spring 2019 Computer Science Department, Colorado State University Week 1-A31

CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 2

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-6 1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-7

Photo Credit:https://datafloq.com/read/car-manufacturers-are-using-big-data/1204

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-8

Connected cars• Single hybrid plug-in car generates up to 25 gigabytes per hour

• Connected cars• $130 billion by 2019• Traffic problem, re-routing based on the volume of traffic• Alerts driver when a road conditions are hazardous by automatically activating anti-lock break

• This information is shared by the vehicles that are nearby

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-9

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-10

The Artemis project: Saving “preemies” using Big Data• The Artemis project

• Dr. Carolyn McGregor

• Toronto’s Hospital for Sick Children, University of Ontario Institute of Technology and IBM• Captures and process the patients’ data in real time

• 16 different data streams

• Heart rate, respiration rate, temperature, blood pressure and blood oxygen level• Around 1,260 data points per second

• System detects subtle changes that may signal the onset of infection 24 hours before overt symptoms appear

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-11

Page 3: CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickaracs535/slides/week1-A.pdf1/23/2019 CS535 Big Data -Spring 2019 Computer Science Department, Colorado State University Week 1-A31

CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 3

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-12

Look Who’s Peeking at Your Paycheck• Experian’s Income Insight

• Estimates people’s income level• Based on their credit history • Trains the estimation model using selected credit history and tax information from IRS

KAREN BLUMENTHAL, “Look W ho’s Peeking at Your Paycheck”, The Wall Street Journal, Jan. 13, 2010, http://www.wsj.com/articles SB10001424052748703672104574654211904801106

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-13

Week 1-A-14

CS535 BIG DATA

PART A. BIG DATA TECHNOLOGY2. COURSE INTRODUCTION

Sangmi Lee PallickaraComputer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs535

My Big Data Lab at Colorado State University• Algorithmic and systems design

• Scalable analytics over voluminous datasets on complex distributed architectures

• Research has been deployed in the following domains• Precision agriculture, atmosphere science, environmental biology, ecology, civil engineering, bioinformatics, and public

health

• Awards• Cochran Family Professorship 2018-2021• IEEE TCSC Award for Excellence in Scalable Computing (Mid-Career Researcher) 2018• National Science Foundation CAREER Award 2016

• Funded by • The National Science Foundation• The Advanced Research Projects Agency-Energy (Department of Energy) • Department of Homeland Security• The Environmental Defense Fund• Google, Amazon, and Hewlett Packard

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-15

People

Saptashwa Mitra

Bibek Shrestha Aaron PereiraKartik Khurana Paahuni Khandelwal

Sangmi Pallickara

Undergraduate researchers, Spring 2018

Walid BudgagaDan Rammer Ryan Becwar

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-16

Communications• Course Website

• http://www.cs.colostate.edu/~cs535• Announcements: Check the course website at least twice a week.• Schedule (course materials, readings, assignments)• Policies

• Canvas• Assignment submission• Discussion board• Grades

• Contact Me• [email protected]• Office hour: Friday 10:00AM ~ 11:00AM and by appointment• Office: CSB456• URL: http://www.cs.colostate.edu/~sangmi

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-17

Page 4: CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickaracs535/slides/week1-A.pdf1/23/2019 CS535 Big Data -Spring 2019 Computer Science Department, Colorado State University Week 1-A31

CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 4

Goal of this course • Understanding fundamental concepts in Big Data Analytics • Learn about existing technologies and how to apply them

Computing systems

Algorithms and models

Analytics

Predictive models

Graph models

Storage systems and middle ware

Computing frameworks

Specialized modeling tools

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-18

Course Structure

Big Data Technology

Week 1 ~ Week 6

Kernel I: Big Data Analytics Advanced Case Study

Week 7, 9

Kernel II: Scalable Analytics Algorithms

Week 10, 11

Kernel III: Large Scale Graph Analysis

Week 12, 13

Kernel IV: Big Data Storage and Analytics

Week 14, 15

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-19

Course Structure | Part A: Big Data Technology• Big Data Technology

• Purposes• Understand concepts of Big Data computing environment • Hands-on experience

• Topics• Introduction to Big Data• Lambda Model

• Distributed file system• Quick view of MapReduce • Introduction to Apache Spark• Analytics with Apache Storm

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-20

Course Structure | Part B: Kernel I ~ IV• Purpose

• Understanding different aspects of Big Data research with lectures and workshops

• Kernels• Kernel I: Big Data Advanced Analytics Case Study• Kernel II: Scalable Analytics Algorithms• Kernel III: Large scale Graph Analysis• Kernel IV: Big Data Storage with Analytics

• Duration: 2 weeks• 3 (x 75 minutes) Lectures• 1 (x 75 minutes) workshop

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-21

Course Structure | Part B: Kernel I ~ IV• Workshop

• Students present leading edge research papers • 3-4 papers per workshop• We will have 4 workshops

• Team Requirements• Presenter

• 1-2 papers

• Report and presentation

• Reader• 3-4 papers

• Report

• Participation

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-22

Course Component | Programing Assignments• Programming Assignment 1

• Implementing link analysis algorithm over the Wikipedia pages using Apache Spark

• Due on 2/26 5:00PM

• The description of PA1 is available now.

• Programming Assignment 2

• Implementing real-time Twitter stream analysis using Apache Storm

• Due on 3/26 5:00PM

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-23

Page 5: CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickaracs535/slides/week1-A.pdf1/23/2019 CS535 Big Data -Spring 2019 Computer Science Department, Colorado State University Week 1-A31

CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 5

Course Component | Programing Assignments• All of the programming assignments are group submission• Submission should be via canvas• Late policy for the assignment submissions

• Up to a maximum of 2 day past the deadline. • 10% penalty per day will be applied

• Each group will provide demo of the programming assignment in CSB120• Each assignment will count 10% of total score of this course

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-24

Course Component | Quizzes• 7-8 Quizzes• Two lowest scores will be eliminated• Quizzes will count 20% of total score of this course

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-25

Course Component | Term Project• Objectives

• Students identify their topics for the term project• Students provide methodology to solve their problem• Students implement software solution• Students provide evaluation scheme for their software

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-26

Course Component | Term Project• Term project grading (40% of this course)

• Term project planning: 1% (e.g. team members, a tentative title of your project)• Proposal document: 4%• Proposal presentation (peer review) : 2%• Final paper: 24% • Final demonstration: 2%• Final presentation (peer review): 3%• Participation: 4%

• Late policy for the deliverable submissions• Up to a maximum of 2 day past the deadline. • 10% penalty per day will be applied

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-27

Course Component | Term Project• Highlights of the Previous Term Projects

• Supporting Emergency Response During Natural Disasters with Twitter Data

• Winning Words in the Supreme Court

• Mendel: A Distributed Storage System for Efficient Sequence Alignment and Similarity Searching

(published in IEEE IPDPS 2016)

• Processing Smart Grid Data In Real Time (DEBS grand challenge 2014)

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-28

Course Component | Term Project• Highlights of the Previous Term Projects

• Time to Answer” for Questions on stackoverflow.com using Map Reduce

• Analysis of words for spell checking in search queries using digitized books and articles

• Efficient Boolean Symmetric Searchable Encryption

• Big Data I/O Performance Improvement Using Buffered B-Tree Algorithm

• Node and Metadata Visualization in a Distributed Hash Table

• Who is Building Wikipedia?

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-29

Page 6: CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickaracs535/slides/week1-A.pdf1/23/2019 CS535 Big Data -Spring 2019 Computer Science Department, Colorado State University Week 1-A31

CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 6

Grading | OverviewCategory Percentage of Final GradeProgramming Assignments 20%

Assignment 1: 10% (1% participation included)Assignment 2: 10% (1% participation included)

Term Project 40%D0: Term project planning: 1%D1: Proposal document: 4%D1: Proposal presentation: 2%D2: Final paper: 24% D2: Final demonstration:2%D3: Final presentation: 3%Participation: 4%

Quizzes 20%

Workshop 20% (2% participation included)

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-30

Grading | Participation Scores

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-31

10% of these scores will be team peer-reviewed participation score

Category Percentage of Final GradeProgramming Assignments 20%

Assignment 1: 10% (1% participation included)Assignment 2: 10% (1% participation included)

Term Project 40%D0: Term project planning: 1%D1: Proposal document: 4%D1: Proposal presentation: 2%D2: Final paper: 24% D2: Final demonstration:2%D3: Final presentation: 3%Participation: 4%

Quizzes 20%

Workshop 20% (2% participation included)

Grading | Final Letter Grade

Letter Grade Total PercentageA 90.00 % and higher

B 80.00 ~ 89.99 %

C 70.00 ~ 79.99 %

D 60.00 ~ 69.99 %

F Below 60.00 %

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-32

Course Component | Course Policy• No make-up for missed quizzes and exams

• Except for the case where student provided an advance written notice to the instructor based on an emergency• Supporting paper works will be requested

• Two lowest quiz scores will be eliminated at the end of semester

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-33

Course Component | Course Policy• No Cell-phones in the class.

• No Laptops in the class.

• If you need to use a laptop during lectures, please sit in the back row.

• I will ask you to turn off your laptop if needed.

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-34

More importantly,• Attend the class, ask questions, and discuss• Check the course web page and canvas regularly • Try new technologies and apply them• Share your experiences with other students in class

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-35

Page 7: CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickaracs535/slides/week1-A.pdf1/23/2019 CS535 Big Data -Spring 2019 Computer Science Department, Colorado State University Week 1-A31

CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 7

Questions?

1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-36


Top Related