data processing paradigms for big datacs535/slides/week2-a-2.pdf · 2021. 1. 18. · cs535 spring...
TRANSCRIPT
CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 1
CS535 BIG DATA
PART A. BIG DATA TECHNOLOGY2. DATA PROCESSING PARADIGMS FOR BIG DATA
Sangmi Lee PallickaraComputer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs535
FAQs• Slides are available on the course web
• Canvas Discussion Board is available: Find your teammates!
• PA1• Hadoop and Spark installation guides are posted• Questions/need helps? Send an email to [email protected] or post your question on Piazza!
CS535 Big Data | Computer Science | Colorado State University
CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 2
Overview of Part A• Duration: Week 1 ~ Week 4
1. Introduction to Big Data (W1)2. Data Process Paradigms for Big Data (W2)3. Distributed Computing Models for Scalable Batch Computing
Part 1. MapReduce (W2)Part 2. In-Memory Cluster Computing Model: Apache Spark (W3, W4)
4. Real-time Streaming Computing Models (W4)Apache Storm and Twitter Heron
CS535 Big Data | Computer Science | Colorado State University
2. Data Processing Paradigms For Big dataLambda Architecture
CS535 Big Data | Computer Science | Colorado State University
CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 3
This material is built based on• Nathan Marz and James Warren, “Big Data, Principles and Best Practices of Scalable
Real-Time Data System”, 2015, Manning Publications, ISBN 9781617290343
CS535 Big Data | Computer Science | Colorado State University
Why do we need Big Data Technologies?• To perform large-scale analytics over voluminous data, we need a high-level architecture that provides,
• Robustness
• Fault-tolerant: Both against hardware failures and human mistakes
• Support for a wide range of workloads and use cases
• Low-latency reads and updates
• Batch analytics jobs
• Scalability
• Scale-out capabilities with minimal maintenance
CS535 Big Data | Computer Science | Colorado State University
CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 4
Typical problems for scaling traditional databases
• Suppose that the application should track the number of page views for any URL a customer wishes to track• The customer’s web page pings the application’s web server with its URL every time a page view is
received• Application tells you top 100 URLs by number of page views
Customer-ID (varchar(255))
url (varchar(255)) Pageviews(bigint)
Coloradoan https://www.coloradoan.com/life 13,483,401Coloradoan httpss://www.coloradoan.com/story/life/2020/01/2
3/hoffman-roast-chicken382
Coloradoan https://www.coloradoan.com/story/sports/csu/football/2020/01/27/temple-football-quarterback-todd-centeio-coming-colorado-state-graduate-transfer
2547
CS535 Big Data | Computer Science | Colorado State University
Direct access• Direct access from Web server to the backend DB cannot handle the large amount
of frequent write requests • Timeout errors Web
server DB
CS535 Big Data | Computer Science | Colorado State University
Customer-ID (varchar(255))
url (varchar(255)) Pageviews(bigint)
Coloradoan https://www.coloradoan.com/life 13,483,401Coloradoan httpss://www.coloradoan.com/story/life/2020/01/2
3/hoffman-roast-chicken382
Coloradoan https://www.coloradoan.com/story/sports/csu/football/2020/01/27/temple-football-quarterback-todd-centeio-coming-colorado-state-graduate-transfer
2547
CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 5
Scaling with a queue
• Batch many increments in a single request
• What if your data amount increases even more?• Your worker cannot keep up with the writes
• What if you add more workers?• Again, the Database will be overloaded
Queue Worker
Web server DB
Pageview
100 at a time
CS535 Big Data | Computer Science | Colorado State University
Scaling by sharding the database
• Horizontal partitioning or sharding of database• Uses multiple database servers and spreads the table across all the servers• Chooses the shard for each key by taking the hash of the key modded by the number of shards
• What if your current number of shards cannot handle your data?• Your mapping script should cope with new set of shards • Application and data should be re-organized
CS535 Big Data | Computer Science | Colorado State University
CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 6
Other issues• Fault-tolerance issues
• What if one of the database machines is down?• A portion of the data is unavailable
• Corruption issues• What if your worker code accidentally generated a bug and stored the wrong number for some of the
data portions
CS535 Big Data | Computer Science | Colorado State University
Reactive Solution
How will Big Data techniques help?• The databases and computation systems used in Big Data applications are aware of
their distributed nature• Sharding and replications will be considered as a fundamental component in the design of Big Data
systems
• E.g. Data is dealt as immutable• Users will mutate data continuously, however
• The raw pageview information is not modified
• Applications will be designed in different ways
CS535 Big Data | Computer Science | Colorado State University
CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 7
Lambda Architecture• Big Data data processing architecture as a series of layers
1. Batch layer
2. Serving layer
3. Speed layer
Speed layer
Serving layer
Batch layer
CS535 Big Data | Computer Science | Colorado State University
1. Batch layer• Batch layer
• Precomputes results using distributed processing system• The component that performs the batch view processing • batch view= function(e.g. Sum of values)• batch view= function(e.g. Training a Predictive Model)
• After the computationàStores an immutable, constantly growing master dataset• E.g. values, model, or distribution
• Computes arbitrary functions on that dataset• Batch-processing systems
• e.g. Hadoop, Spark, TensorFlow
CS535 Big Data | Computer Science | Colorado State University
CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 8
Generating batch views
Data Batch layer
Batch view
Batch view
Batch view
Batch layer is often a high-latency operation
CS535 Big Data | Computer Science | Colorado State University
Generating batch views: E.g. PageRank
Data Batch layer
Batch view
Batch view
Batch viewCrawling the Web
every day
CS535 Big Data | Computer Science | Colorado State University
Generate the PageRank values every 24 hours
PageRank values of the web on 01/25/2020
PageRank values of the web on 01/27/2020PageRank values of the
web on 01/26/2020
CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 9
2. Serving layer• The batch layer emits batch view as the result of its functions
• These views should be loaded somewhere and queried
• Specialized distributed database that loads in a batch view and makes it possible to do random reads on it
• Batch update and random reads should be supported• e.g. BigQuery, ElephantDB, Dynamo, MongoDB, Cassandra
CS535 Big Data | Computer Science | Colorado State University
3. Speed layer
• Q: Is there any data not represented in the batch view?
• Data arrives while the precomputation is running
• With fully real-time data system
• Speed layer looks only at recent data
• Whereas the batch layer looks at all the data (except real-time data) at once
• Real-time view= function(real-time view, real-time computing, new data)
CS535 Big Data | Computer Science | Colorado State University
CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 10
How long should the real time view be maintained?
• Once the data arrives at the serving layer, the corresponding results in the real-time views are no longer needed• You can discard pieces of the real-time views
CS535 Big Data | Computer Science | Colorado State University
Lambda architecture
ProcessStream Increment view
Realtime increment
Speed layer
Batch layerProcessStream Increment view
Batch recompute
New data stream
Serving layer
Query
View 1 View 2 View 3
Batch views
View 1 View 2 View 3
Realtime views
merge
Composing Algorithm
CS535 Big Data | Computer Science | Colorado State University
CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 11
Relevance of Data
Data absorbed into batch view Data absorbed into real-time view
Time
Now
CS535 Big Data | Computer Science | Colorado State University
Practical Example with Lambda architecture• Web analytics application tracking the number of page views over a range of days
• The speed layer keeps its own separate view of [url, day]
• Updates its views by incrementing the count in the view whenever it receives new data
• The batch layer recomputes its views by counting the page views
• To resolve the query, you query both the batch and real-time views
• With satisfying ranges
• Sum up the results
CS535 Big Data | Computer Science | Colorado State University
CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 12
Exercise Suppose you are running an analytics system designed based on the Lambda Architecture. The first batch job has started without any dataset and it took 5 minutes to complete. The second batch job is scheduled 5 minutes after the first job has completed. The second batch job is performed with data that arrived in the first 10 minutes.
Question 1. After 7 minutes of running your system, what is the coverage of data stored in the Speed layer?
a. none b. 0 ~ 7 minutes c. 5 ~ 7 minutes d. 0 ~ 5 minutes
Question 2. After 7 minutes of running your system, what is the coverage of data stored in the Batch layer?
a. none b. 0 ~ 7 minutes c. 5 ~ 7 minutes d. 0 ~ 5 minutes
CS535 Big Data | Computer Science | Colorado State University
Exercise—Answers Suppose you are running an analytics system designed based on the Lambda Architecture. The first batch job has started without any dataset and it took 5 minutes to complete. The second batch job is scheduled 5 minutes after the first job has completed. The second batch job is performed with data that arrived in the first 10 minutes.
Question 1. After 7 minutes of running your system, what is the coverage of data stored in the Speed layer?
a. none b. 0 ~ 7 minutes c. 5 ~ 7 minutes d. 0 ~ 5 minutes
Question 2. After 7 minutes of running your system, what is the coverage of data stored in the Batch layer?
a. none b. 0 ~ 7 minutes c. 5 ~ 7 minutes d. 0 ~ 5 minutes
CS535 Big Data | Computer Science | Colorado State University
CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 13
Composing algorithms• The batch/speed layer will split your data
• The exact algorithm on the batch layer• An approximate algorithm on the speed layer
• The batch layer repeatedly overrides the speed layer• The approximation gets corrected• Eventual accuracy
CS535 Big Data | Computer Science | Colorado State University
Example of Cardinality Estimation• Cardinality estimation
• Count-distinct problem: finding number of distinct elements
• Counting exact unique counts in the batch layer
• A Hyper-LogLog as an approximation in the speed layer
• Batch layer corrects what’s computed in the speed layer• Eventual accuracy
CS535 Big Data | Computer Science | Colorado State University
CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 14
Recent trends in technology (1/3)• Physical limits of how fast a single CPU can go
• Parallelize computation to scale to more data• Scale-out solution
• Elastic clouds• Infrastructure as a Service (IaaS)• Rent hardware on demand rather than owning your hardware• Increase and decrease the size of your cluster nearly instantaneously• Simplifies system administration
CS535 Big Data | Computer Science | Colorado State University
Recent trends in technology (2/3)• Open source ecosystem for Big Data
• Batch computation systems• Hadoop• Spark• Flink
• Serialization frameworks• Serializes an object into a byte array from any language
• Deserialize that byte array into an object in any language
• Thrift, Protocol Buffers, and Avro
CS535 Big Data | Computer Science | Colorado State University
CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 15
Recent trends in technology (3/3)• Open source ecosystem for Big Data- cont.
• Random-access NoSQL databases• Sacrifice the full expressiveness of SQL • Specializes in certain kinds of operations• Cassandra, Hbase, MongoDB, etc.
• Messaging/queuing systems• Sends and consumes messages between processes in a fault-tolerant manner• Apache Kafka
• Real-time computation system• High throughput, low latency, stream-processing systems• Apache Storm
CS535 Big Data | Computer Science | Colorado State University
Mapping recent technologies
ProcessStream Increment view
Realtime increment
Speed layer
Batch layer
Serving layer
Query
View 1 View 2 View 3
Batch views
View 1 View 2 View 3
Realtime views
mergeNew data stream
ProcessStream Increment view
Batch recompute
CS535 Big Data | Computer Science | Colorado State University
CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 16
Mapping course components
ProcessStream Increment view
Realtime increment
Speed layer
Batch layer
Serving layer
Query
View 1 View 2 View 3
Batch views
View 1 View 2 View 3
Realtime views
mergeNew data stream
ProcessStream Increment view
Batch recomputePart 1,
GEAR2,3,4,5
Part 1GEAR
2,5
GEAR 1
PA1
PA2
CS535 Big Data | Computer Science | Colorado State University
Kappa architecture
Serving layer
Query
Job n
Job n+1
Stream processingsystem
ProcessStream
output n
output n+1
Jay Kreps, LinkedInKappa architecture is a simplified version of lambda architectureAppend only immutable log
CS535 Big Data | Computer Science | Colorado State University
CS535 Big Data 1/27/2020 Week 2-A Sangmi Lee Pallickara
http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, Page 17
Brainstorming Quiz
• Lambda architecture vs. Kappa architecture
Question 1. What is the difference between Lambda architecture and Kappa architecture? Question 2. Use case of Lambda architecture?
Question 3. Use case of Kappa architecture?
CS535 Big Data | Computer Science | Colorado State University
Questions?
CS535 Big Data | Computer Science | Colorado State University