CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 1
Week 1-A-0
CS535 BIG DATA
PART A. BIG DATA TECHNOLOGY1. INTRODUCTION TO BIG DATA
Sangmi Lee PallickaraComputer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs535
What is Big Data?
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-1
Big Data• Things one can do at a large scale that cannot be done at a smaller one
• To extract new insights• Create new forms of values
• Big Data is about analytics of huge quantities of data in order to infer probabilities• Big Data is NOT about trying to “teach” a computer to “think” like humans
• Providing a quantitative dimension it never had before
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-2
The three(or four) Vs in Big Data• Volume
• Voluminous• It does not have to be certain number of petabytes or quantity.
• Velocity• How fast the data is coming in?• How fast you need to be able to analyze and utilize it
• Variety • Number of sources or incoming vectors
• Veracity• Can you trust the data itself, source of the data, or the process?• User entry errors, redundancy, corruption of the values• Data cleaning
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-3
Related research areas• Storage systems
• How can we efficiently resolve queries on massive amounts of input data?• The input dataset may be presented in the form of a distributed data stream
• Machine learning• How can we efficiently solve large-scale machine learning problems?• The input data may be massive; stored in a distributed cluster of machines
• Distributed computing• How can we efficiently solve large-scale optimization problems in distributed computing environments?• For example, how can we efficiently solve large-scale combinatorial problems, e.g. processing of large
scale graphs?
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-4
Who is using Big Data?
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-5
CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 2
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-6 1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-7
Photo Credit:https://datafloq.com/read/car-manufacturers-are-using-big-data/1204
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-8
Connected cars• Single hybrid plug-in car generates up to 25 gigabytes per hour
• Connected cars• $130 billion by 2019• Traffic problem, re-routing based on the volume of traffic• Alerts driver when a road conditions are hazardous by automatically activating anti-lock break
• This information is shared by the vehicles that are nearby
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-9
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-10
The Artemis project: Saving “preemies” using Big Data• The Artemis project
• Dr. Carolyn McGregor
• Toronto’s Hospital for Sick Children, University of Ontario Institute of Technology and IBM• Captures and process the patients’ data in real time
• 16 different data streams
• Heart rate, respiration rate, temperature, blood pressure and blood oxygen level• Around 1,260 data points per second
• System detects subtle changes that may signal the onset of infection 24 hours before overt symptoms appear
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-11
CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 3
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-12
Look Who’s Peeking at Your Paycheck• Experian’s Income Insight
• Estimates people’s income level• Based on their credit history • Trains the estimation model using selected credit history and tax information from IRS
KAREN BLUMENTHAL, “Look W ho’s Peeking at Your Paycheck”, The Wall Street Journal, Jan. 13, 2010, http://www.wsj.com/articles SB10001424052748703672104574654211904801106
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-13
Week 1-A-14
CS535 BIG DATA
PART A. BIG DATA TECHNOLOGY2. COURSE INTRODUCTION
Sangmi Lee PallickaraComputer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs535
My Big Data Lab at Colorado State University• Algorithmic and systems design
• Scalable analytics over voluminous datasets on complex distributed architectures
• Research has been deployed in the following domains• Precision agriculture, atmosphere science, environmental biology, ecology, civil engineering, bioinformatics, and public
health
• Awards• Cochran Family Professorship 2018-2021• IEEE TCSC Award for Excellence in Scalable Computing (Mid-Career Researcher) 2018• National Science Foundation CAREER Award 2016
• Funded by • The National Science Foundation• The Advanced Research Projects Agency-Energy (Department of Energy) • Department of Homeland Security• The Environmental Defense Fund• Google, Amazon, and Hewlett Packard
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-15
People
Saptashwa Mitra
Bibek Shrestha Aaron PereiraKartik Khurana Paahuni Khandelwal
Sangmi Pallickara
Undergraduate researchers, Spring 2018
Walid BudgagaDan Rammer Ryan Becwar
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-16
Communications• Course Website
• http://www.cs.colostate.edu/~cs535• Announcements: Check the course website at least twice a week.• Schedule (course materials, readings, assignments)• Policies
• Canvas• Assignment submission• Discussion board• Grades
• Contact Me• [email protected]• Office hour: Friday 10:00AM ~ 11:00AM and by appointment• Office: CSB456• URL: http://www.cs.colostate.edu/~sangmi
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-17
CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 4
Goal of this course • Understanding fundamental concepts in Big Data Analytics • Learn about existing technologies and how to apply them
Computing systems
Algorithms and models
Analytics
Predictive models
Graph models
Storage systems and middle ware
Computing frameworks
Specialized modeling tools
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-18
Course Structure
Big Data Technology
Week 1 ~ Week 6
Kernel I: Big Data Analytics Advanced Case Study
Week 7, 9
Kernel II: Scalable Analytics Algorithms
Week 10, 11
Kernel III: Large Scale Graph Analysis
Week 12, 13
Kernel IV: Big Data Storage and Analytics
Week 14, 15
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-19
Course Structure | Part A: Big Data Technology• Big Data Technology
• Purposes• Understand concepts of Big Data computing environment • Hands-on experience
• Topics• Introduction to Big Data• Lambda Model
• Distributed file system• Quick view of MapReduce • Introduction to Apache Spark• Analytics with Apache Storm
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-20
Course Structure | Part B: Kernel I ~ IV• Purpose
• Understanding different aspects of Big Data research with lectures and workshops
• Kernels• Kernel I: Big Data Advanced Analytics Case Study• Kernel II: Scalable Analytics Algorithms• Kernel III: Large scale Graph Analysis• Kernel IV: Big Data Storage with Analytics
• Duration: 2 weeks• 3 (x 75 minutes) Lectures• 1 (x 75 minutes) workshop
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-21
Course Structure | Part B: Kernel I ~ IV• Workshop
• Students present leading edge research papers • 3-4 papers per workshop• We will have 4 workshops
• Team Requirements• Presenter
• 1-2 papers
• Report and presentation
• Reader• 3-4 papers
• Report
• Participation
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-22
Course Component | Programing Assignments• Programming Assignment 1
• Implementing link analysis algorithm over the Wikipedia pages using Apache Spark
• Due on 2/26 5:00PM
• The description of PA1 is available now.
• Programming Assignment 2
• Implementing real-time Twitter stream analysis using Apache Storm
• Due on 3/26 5:00PM
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-23
CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 5
Course Component | Programing Assignments• All of the programming assignments are group submission• Submission should be via canvas• Late policy for the assignment submissions
• Up to a maximum of 2 day past the deadline. • 10% penalty per day will be applied
• Each group will provide demo of the programming assignment in CSB120• Each assignment will count 10% of total score of this course
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-24
Course Component | Quizzes• 7-8 Quizzes• Two lowest scores will be eliminated• Quizzes will count 20% of total score of this course
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-25
Course Component | Term Project• Objectives
• Students identify their topics for the term project• Students provide methodology to solve their problem• Students implement software solution• Students provide evaluation scheme for their software
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-26
Course Component | Term Project• Term project grading (40% of this course)
• Term project planning: 1% (e.g. team members, a tentative title of your project)• Proposal document: 4%• Proposal presentation (peer review) : 2%• Final paper: 24% • Final demonstration: 2%• Final presentation (peer review): 3%• Participation: 4%
• Late policy for the deliverable submissions• Up to a maximum of 2 day past the deadline. • 10% penalty per day will be applied
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-27
Course Component | Term Project• Highlights of the Previous Term Projects
• Supporting Emergency Response During Natural Disasters with Twitter Data
• Winning Words in the Supreme Court
• Mendel: A Distributed Storage System for Efficient Sequence Alignment and Similarity Searching
(published in IEEE IPDPS 2016)
• Processing Smart Grid Data In Real Time (DEBS grand challenge 2014)
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-28
Course Component | Term Project• Highlights of the Previous Term Projects
• Time to Answer” for Questions on stackoverflow.com using Map Reduce
• Analysis of words for spell checking in search queries using digitized books and articles
• Efficient Boolean Symmetric Searchable Encryption
• Big Data I/O Performance Improvement Using Buffered B-Tree Algorithm
• Node and Metadata Visualization in a Distributed Hash Table
• Who is Building Wikipedia?
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-29
CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 6
Grading | OverviewCategory Percentage of Final GradeProgramming Assignments 20%
Assignment 1: 10% (1% participation included)Assignment 2: 10% (1% participation included)
Term Project 40%D0: Term project planning: 1%D1: Proposal document: 4%D1: Proposal presentation: 2%D2: Final paper: 24% D2: Final demonstration:2%D3: Final presentation: 3%Participation: 4%
Quizzes 20%
Workshop 20% (2% participation included)
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-30
Grading | Participation Scores
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-31
10% of these scores will be team peer-reviewed participation score
Category Percentage of Final GradeProgramming Assignments 20%
Assignment 1: 10% (1% participation included)Assignment 2: 10% (1% participation included)
Term Project 40%D0: Term project planning: 1%D1: Proposal document: 4%D1: Proposal presentation: 2%D2: Final paper: 24% D2: Final demonstration:2%D3: Final presentation: 3%Participation: 4%
Quizzes 20%
Workshop 20% (2% participation included)
Grading | Final Letter Grade
Letter Grade Total PercentageA 90.00 % and higher
B 80.00 ~ 89.99 %
C 70.00 ~ 79.99 %
D 60.00 ~ 69.99 %
F Below 60.00 %
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-32
Course Component | Course Policy• No make-up for missed quizzes and exams
• Except for the case where student provided an advance written notice to the instructor based on an emergency• Supporting paper works will be requested
• Two lowest quiz scores will be eliminated at the end of semester
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-33
Course Component | Course Policy• No Cell-phones in the class.
• No Laptops in the class.
• If you need to use a laptop during lectures, please sit in the back row.
• I will ask you to turn off your laptop if needed.
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-34
More importantly,• Attend the class, ask questions, and discuss• Check the course web page and canvas regularly • Try new technologies and apply them• Share your experiences with other students in class
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-35
CS535 Big Data 1/23/2019 Week 1-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 7
Questions?
1/23/2019 CS535 Big Data - Spring 2019 Computer Science Department, Colorado State University Week 1-A-36