data processing paradigms for big datacs535/slides/week2-a-2.pdf · 2021. 1. 18. · cs535 spring...

17
CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 1 CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 2. DATA PROCESSING PARADIGMS FOR BIG DATA Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 FAQs Slides are available on the course web Canvas Discussion Board is available: Find your teammates! PA1 Hadoop and Spark installation guides are posted Questions/need helps? Send an email to [email protected] or post your question on Piazza! CS535 Big Data | Computer Science | Colorado State University

Upload: others

Post on 24-Apr-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DATA PROCESSING PARADIGMS FOR BIG DATAcs535/slides/week2-A-2.pdf · 2021. 1. 18. · cs535 Spring 2020 Colorado State University, Page 13 Composing algorithms •The batch/speed layer

CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 1

CS535 BIG DATA

PART A. BIG DATA TECHNOLOGY2. DATA PROCESSING PARADIGMS FOR BIG DATA

Sangmi Lee PallickaraComputer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs535

FAQs• Slides are available on the course web

• Canvas Discussion Board is available: Find your teammates!

• PA1• Hadoop and Spark installation guides are posted• Questions/need helps? Send an email to [email protected] or post your question on Piazza!

CS535 Big Data | Computer Science | Colorado State University

Page 2: DATA PROCESSING PARADIGMS FOR BIG DATAcs535/slides/week2-A-2.pdf · 2021. 1. 18. · cs535 Spring 2020 Colorado State University, Page 13 Composing algorithms •The batch/speed layer

CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 2

Overview of Part A• Duration: Week 1 ~ Week 4

1. Introduction to Big Data (W1)2. Data Process Paradigms for Big Data (W2)3. Distributed Computing Models for Scalable Batch Computing

Part 1. MapReduce (W2)Part 2. In-Memory Cluster Computing Model: Apache Spark (W3, W4)

4. Real-time Streaming Computing Models (W4)Apache Storm and Twitter Heron

CS535 Big Data | Computer Science | Colorado State University

2. Data Processing Paradigms For Big dataLambda Architecture

CS535 Big Data | Computer Science | Colorado State University

Page 3: DATA PROCESSING PARADIGMS FOR BIG DATAcs535/slides/week2-A-2.pdf · 2021. 1. 18. · cs535 Spring 2020 Colorado State University, Page 13 Composing algorithms •The batch/speed layer

CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 3

This material is built based on• Nathan Marz and James Warren, “Big Data, Principles and Best Practices of Scalable

Real-Time Data System”, 2015, Manning Publications, ISBN 9781617290343

CS535 Big Data | Computer Science | Colorado State University

Why do we need Big Data Technologies?• To perform large-scale analytics over voluminous data, we need a high-level architecture that provides,

• Robustness

• Fault-tolerant: Both against hardware failures and human mistakes

• Support for a wide range of workloads and use cases

• Low-latency reads and updates

• Batch analytics jobs

• Scalability

• Scale-out capabilities with minimal maintenance

CS535 Big Data | Computer Science | Colorado State University

Page 4: DATA PROCESSING PARADIGMS FOR BIG DATAcs535/slides/week2-A-2.pdf · 2021. 1. 18. · cs535 Spring 2020 Colorado State University, Page 13 Composing algorithms •The batch/speed layer

CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 4

Typical problems for scaling traditional databases

• Suppose that the application should track the number of page views for any URL a customer wishes to track• The customer’s web page pings the application’s web server with its URL every time a page view is

received• Application tells you top 100 URLs by number of page views

Customer-ID (varchar(255))

url (varchar(255)) Pageviews(bigint)

Coloradoan https://www.coloradoan.com/life 13,483,401Coloradoan httpss://www.coloradoan.com/story/life/2020/01/2

3/hoffman-roast-chicken382

Coloradoan https://www.coloradoan.com/story/sports/csu/football/2020/01/27/temple-football-quarterback-todd-centeio-coming-colorado-state-graduate-transfer

2547

CS535 Big Data | Computer Science | Colorado State University

Direct access• Direct access from Web server to the backend DB cannot handle the large amount

of frequent write requests • Timeout errors Web

server DB

CS535 Big Data | Computer Science | Colorado State University

Customer-ID (varchar(255))

url (varchar(255)) Pageviews(bigint)

Coloradoan https://www.coloradoan.com/life 13,483,401Coloradoan httpss://www.coloradoan.com/story/life/2020/01/2

3/hoffman-roast-chicken382

Coloradoan https://www.coloradoan.com/story/sports/csu/football/2020/01/27/temple-football-quarterback-todd-centeio-coming-colorado-state-graduate-transfer

2547

Page 5: DATA PROCESSING PARADIGMS FOR BIG DATAcs535/slides/week2-A-2.pdf · 2021. 1. 18. · cs535 Spring 2020 Colorado State University, Page 13 Composing algorithms •The batch/speed layer

CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 5

Scaling with a queue

• Batch many increments in a single request

• What if your data amount increases even more?• Your worker cannot keep up with the writes

• What if you add more workers?• Again, the Database will be overloaded

Queue Worker

Web server DB

Pageview

100 at a time

CS535 Big Data | Computer Science | Colorado State University

Scaling by sharding the database

• Horizontal partitioning or sharding of database• Uses multiple database servers and spreads the table across all the servers• Chooses the shard for each key by taking the hash of the key modded by the number of shards

• What if your current number of shards cannot handle your data?• Your mapping script should cope with new set of shards • Application and data should be re-organized

CS535 Big Data | Computer Science | Colorado State University

Page 6: DATA PROCESSING PARADIGMS FOR BIG DATAcs535/slides/week2-A-2.pdf · 2021. 1. 18. · cs535 Spring 2020 Colorado State University, Page 13 Composing algorithms •The batch/speed layer

CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 6

Other issues• Fault-tolerance issues

• What if one of the database machines is down?• A portion of the data is unavailable

• Corruption issues• What if your worker code accidentally generated a bug and stored the wrong number for some of the

data portions

CS535 Big Data | Computer Science | Colorado State University

Reactive Solution

How will Big Data techniques help?• The databases and computation systems used in Big Data applications are aware of

their distributed nature• Sharding and replications will be considered as a fundamental component in the design of Big Data

systems

• E.g. Data is dealt as immutable• Users will mutate data continuously, however

• The raw pageview information is not modified

• Applications will be designed in different ways

CS535 Big Data | Computer Science | Colorado State University

Page 7: DATA PROCESSING PARADIGMS FOR BIG DATAcs535/slides/week2-A-2.pdf · 2021. 1. 18. · cs535 Spring 2020 Colorado State University, Page 13 Composing algorithms •The batch/speed layer

CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 7

Lambda Architecture• Big Data data processing architecture as a series of layers

1. Batch layer

2. Serving layer

3. Speed layer

Speed layer

Serving layer

Batch layer

CS535 Big Data | Computer Science | Colorado State University

1. Batch layer• Batch layer

• Precomputes results using distributed processing system• The component that performs the batch view processing • batch view= function(e.g. Sum of values)• batch view= function(e.g. Training a Predictive Model)

• After the computationàStores an immutable, constantly growing master dataset• E.g. values, model, or distribution

• Computes arbitrary functions on that dataset• Batch-processing systems

• e.g. Hadoop, Spark, TensorFlow

CS535 Big Data | Computer Science | Colorado State University

Page 8: DATA PROCESSING PARADIGMS FOR BIG DATAcs535/slides/week2-A-2.pdf · 2021. 1. 18. · cs535 Spring 2020 Colorado State University, Page 13 Composing algorithms •The batch/speed layer

CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 8

Generating batch views

Data Batch layer

Batch view

Batch view

Batch view

Batch layer is often a high-latency operation

CS535 Big Data | Computer Science | Colorado State University

Generating batch views: E.g. PageRank

Data Batch layer

Batch view

Batch view

Batch viewCrawling the Web

every day

CS535 Big Data | Computer Science | Colorado State University

Generate the PageRank values every 24 hours

PageRank values of the web on 01/25/2020

PageRank values of the web on 01/27/2020PageRank values of the

web on 01/26/2020

Page 9: DATA PROCESSING PARADIGMS FOR BIG DATAcs535/slides/week2-A-2.pdf · 2021. 1. 18. · cs535 Spring 2020 Colorado State University, Page 13 Composing algorithms •The batch/speed layer

CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 9

2. Serving layer• The batch layer emits batch view as the result of its functions

• These views should be loaded somewhere and queried

• Specialized distributed database that loads in a batch view and makes it possible to do random reads on it

• Batch update and random reads should be supported• e.g. BigQuery, ElephantDB, Dynamo, MongoDB, Cassandra

CS535 Big Data | Computer Science | Colorado State University

3. Speed layer

• Q: Is there any data not represented in the batch view?

• Data arrives while the precomputation is running

• With fully real-time data system

• Speed layer looks only at recent data

• Whereas the batch layer looks at all the data (except real-time data) at once

• Real-time view= function(real-time view, real-time computing, new data)

CS535 Big Data | Computer Science | Colorado State University

Page 10: DATA PROCESSING PARADIGMS FOR BIG DATAcs535/slides/week2-A-2.pdf · 2021. 1. 18. · cs535 Spring 2020 Colorado State University, Page 13 Composing algorithms •The batch/speed layer

CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 10

How long should the real time view be maintained?

• Once the data arrives at the serving layer, the corresponding results in the real-time views are no longer needed• You can discard pieces of the real-time views

CS535 Big Data | Computer Science | Colorado State University

Lambda architecture

ProcessStream Increment view

Realtime increment

Speed layer

Batch layerProcessStream Increment view

Batch recompute

New data stream

Serving layer

Query

View 1 View 2 View 3

Batch views

View 1 View 2 View 3

Realtime views

merge

Composing Algorithm

CS535 Big Data | Computer Science | Colorado State University

Page 11: DATA PROCESSING PARADIGMS FOR BIG DATAcs535/slides/week2-A-2.pdf · 2021. 1. 18. · cs535 Spring 2020 Colorado State University, Page 13 Composing algorithms •The batch/speed layer

CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 11

Relevance of Data

Data absorbed into batch view Data absorbed into real-time view

Time

Now

CS535 Big Data | Computer Science | Colorado State University

Practical Example with Lambda architecture• Web analytics application tracking the number of page views over a range of days

• The speed layer keeps its own separate view of [url, day]

• Updates its views by incrementing the count in the view whenever it receives new data

• The batch layer recomputes its views by counting the page views

• To resolve the query, you query both the batch and real-time views

• With satisfying ranges

• Sum up the results

CS535 Big Data | Computer Science | Colorado State University

Page 12: DATA PROCESSING PARADIGMS FOR BIG DATAcs535/slides/week2-A-2.pdf · 2021. 1. 18. · cs535 Spring 2020 Colorado State University, Page 13 Composing algorithms •The batch/speed layer

CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 12

Exercise Suppose you are running an analytics system designed based on the Lambda Architecture. The first batch job has started without any dataset and it took 5 minutes to complete. The second batch job is scheduled 5 minutes after the first job has completed. The second batch job is performed with data that arrived in the first 10 minutes.

Question 1. After 7 minutes of running your system, what is the coverage of data stored in the Speed layer?

a. none b. 0 ~ 7 minutes c. 5 ~ 7 minutes d. 0 ~ 5 minutes

Question 2. After 7 minutes of running your system, what is the coverage of data stored in the Batch layer?

a. none b. 0 ~ 7 minutes c. 5 ~ 7 minutes d. 0 ~ 5 minutes

CS535 Big Data | Computer Science | Colorado State University

Exercise—Answers Suppose you are running an analytics system designed based on the Lambda Architecture. The first batch job has started without any dataset and it took 5 minutes to complete. The second batch job is scheduled 5 minutes after the first job has completed. The second batch job is performed with data that arrived in the first 10 minutes.

Question 1. After 7 minutes of running your system, what is the coverage of data stored in the Speed layer?

a. none b. 0 ~ 7 minutes c. 5 ~ 7 minutes d. 0 ~ 5 minutes

Question 2. After 7 minutes of running your system, what is the coverage of data stored in the Batch layer?

a. none b. 0 ~ 7 minutes c. 5 ~ 7 minutes d. 0 ~ 5 minutes

CS535 Big Data | Computer Science | Colorado State University

Page 13: DATA PROCESSING PARADIGMS FOR BIG DATAcs535/slides/week2-A-2.pdf · 2021. 1. 18. · cs535 Spring 2020 Colorado State University, Page 13 Composing algorithms •The batch/speed layer

CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 13

Composing algorithms• The batch/speed layer will split your data

• The exact algorithm on the batch layer• An approximate algorithm on the speed layer

• The batch layer repeatedly overrides the speed layer• The approximation gets corrected• Eventual accuracy

CS535 Big Data | Computer Science | Colorado State University

Example of Cardinality Estimation• Cardinality estimation

• Count-distinct problem: finding number of distinct elements

• Counting exact unique counts in the batch layer

• A Hyper-LogLog as an approximation in the speed layer

• Batch layer corrects what’s computed in the speed layer• Eventual accuracy

CS535 Big Data | Computer Science | Colorado State University

Page 14: DATA PROCESSING PARADIGMS FOR BIG DATAcs535/slides/week2-A-2.pdf · 2021. 1. 18. · cs535 Spring 2020 Colorado State University, Page 13 Composing algorithms •The batch/speed layer

CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 14

Recent trends in technology (1/3)• Physical limits of how fast a single CPU can go

• Parallelize computation to scale to more data• Scale-out solution

• Elastic clouds• Infrastructure as a Service (IaaS)• Rent hardware on demand rather than owning your hardware• Increase and decrease the size of your cluster nearly instantaneously• Simplifies system administration

CS535 Big Data | Computer Science | Colorado State University

Recent trends in technology (2/3)• Open source ecosystem for Big Data

• Batch computation systems• Hadoop• Spark• Flink

• Serialization frameworks• Serializes an object into a byte array from any language

• Deserialize that byte array into an object in any language

• Thrift, Protocol Buffers, and Avro

CS535 Big Data | Computer Science | Colorado State University

Page 15: DATA PROCESSING PARADIGMS FOR BIG DATAcs535/slides/week2-A-2.pdf · 2021. 1. 18. · cs535 Spring 2020 Colorado State University, Page 13 Composing algorithms •The batch/speed layer

CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 15

Recent trends in technology (3/3)• Open source ecosystem for Big Data- cont.

• Random-access NoSQL databases• Sacrifice the full expressiveness of SQL • Specializes in certain kinds of operations• Cassandra, Hbase, MongoDB, etc.

• Messaging/queuing systems• Sends and consumes messages between processes in a fault-tolerant manner• Apache Kafka

• Real-time computation system• High throughput, low latency, stream-processing systems• Apache Storm

CS535 Big Data | Computer Science | Colorado State University

Mapping recent technologies

ProcessStream Increment view

Realtime increment

Speed layer

Batch layer

Serving layer

Query

View 1 View 2 View 3

Batch views

View 1 View 2 View 3

Realtime views

mergeNew data stream

ProcessStream Increment view

Batch recompute

CS535 Big Data | Computer Science | Colorado State University

Page 16: DATA PROCESSING PARADIGMS FOR BIG DATAcs535/slides/week2-A-2.pdf · 2021. 1. 18. · cs535 Spring 2020 Colorado State University, Page 13 Composing algorithms •The batch/speed layer

CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 16

Mapping course components

ProcessStream Increment view

Realtime increment

Speed layer

Batch layer

Serving layer

Query

View 1 View 2 View 3

Batch views

View 1 View 2 View 3

Realtime views

mergeNew data stream

ProcessStream Increment view

Batch recomputePart 1,

GEAR2,3,4,5

Part 1GEAR

2,5

GEAR 1

PA1

PA2

CS535 Big Data | Computer Science | Colorado State University

Kappa architecture

Serving layer

Query

Job n

Job n+1

Stream processingsystem

ProcessStream

output n

output n+1

Jay Kreps, LinkedInKappa architecture is a simplified version of lambda architectureAppend only immutable log

CS535 Big Data | Computer Science | Colorado State University

Page 17: DATA PROCESSING PARADIGMS FOR BIG DATAcs535/slides/week2-A-2.pdf · 2021. 1. 18. · cs535 Spring 2020 Colorado State University, Page 13 Composing algorithms •The batch/speed layer

CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 17

Brainstorming Quiz

• Lambda architecture vs. Kappa architecture

Question 1. What is the difference between Lambda architecture and Kappa architecture? Question 2. Use case of Lambda architecture?

Question 3. Use case of Kappa architecture?

CS535 Big Data | Computer Science | Colorado State University

Questions?

CS535 Big Data | Computer Science | Colorado State University