![Page 1: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/1.jpg)
Distributed Systems CS6421
Intro to Distributed Systems and the Cloud
Prof. Tim Wood
![Page 2: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/2.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Tim Woodv
!2
I teach: Software Engineering, Operating Systems, Sr. Design I like: distributed systems, networks, building cool things
![Page 3: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/3.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Lots of computers!
!3
0
100,000,000
200,000,000
300,000,000
400,000,000
1993 1996 1999 2002 2005
Computers Web Servers
![Page 4: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/4.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Lots of problems!How to interconnect them? How to write programs for them? How to secure them? How to manage them? How to get good performance out of them? How to efficiently use resources?
!4
![Page 5: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/5.jpg)
Tim Wood - The George Washington University - Department of Computer Science
This CourseCSci 6421 Distributed Systems Prof. Tim Wood - [email protected]
Prerequisites: - CSci 6212 Algorithms (or undergrad algorithms course) - an undergraduate operating systems course - strong programming abilities, comfort with multiple languages
!5
https://gwdistsys18.github.io/
![Page 6: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/6.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Be ready to codeWe will be writing code both in class and for your homework assignments
- This course will be a lot of work! - You only learn by doing!
You should be comfortable in at least one of these programming languages:
- C, Python, or Java - and you should be ready and eager to learn the others! - and you must understand the basics of the Linux command line
!6
![Page 7: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/7.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Be ready to participateThis class will be active and interactive
- I will not be lecturing for very long each period
You must participate in class - Ask questions, answer questions - Do the in-class exercises - Even if you don’t speak english well, that’s fine. I will be patient!
You may not use your laptop except for times when I tell you to
- If a slide has a blue bar at the bottom, you may not use it - If a slide has a green bar, you may use your laptop
!7
![Page 8: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/8.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Semester OutlineTopic 1 Intro to Distributed Systems and the Cloud Topic 2 Network Programming (Sockets) Topic 3 Servers (EC2, VMs, Containers) Topic 4 Storage (S3, EBS) Topic 5 Networks (SDN, NFV) Topic 6 Scalable Web Services (nginx, memcache, RDS) Topic 9 Edge and Serverless Computing (AWS Lambda) Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10Fault Tolerance, Ordering, and Consistency (DynamoDB)
(Not necessarily in this order)
!8
![Page 9: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/9.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Course OutcomesYou will learn to:
Use important technologies common in today’s distributed systems
- Hadoop, Amazon’s cloud, etc.
Design coordination protocols that are efficient and correct despite network delays and failures Implement and evaluate the performance of these systems
!9
![Page 10: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/10.jpg)
What is a cloud? <spoiler alert>
It’s not in the sky it’s not made of water droplets
</spoiler alert>
![Page 11: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/11.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Some big buildings...
!11
Microsoft’s Dublin data center
![Page 12: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/12.jpg)
Tim Wood - The George Washington University - Department of Computer Science 5
...and servers...Giant warehouses
- The size of 10 football fields - 10s of thousands of servers - Petabytes of storage
![Page 13: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/13.jpg)
Tim Wood - The George Washington University - Department of Computer Science
...interconnected...
!13
![Page 14: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/14.jpg)
Tim Wood - The George Washington University - Department of Computer Science
...around the world...Undersea Cables
- Connect all continents except Antarctica
- First deployed in 1850s
!14http://submarine-cable-map-2013.telegeography.com/
http://www.cyprusupdates.com
![Page 15: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/15.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Comcast down after hunter shoots cable (2008)
...that break a lot.
!15
Truck crash shuts down Amazon data center (May 2010)
Anchor hits underwater Internet cable (Feb 2012)
Lightning causes Amazon outages (2009 and 2011)
![Page 16: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/16.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Or if you’re really unlucky...
!16
vs
![Page 17: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/17.jpg)
Why do we need all of this physical
infrastructure?
![Page 18: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/18.jpg)
Tim Wood - The George Washington University - Department of Computer Science
What is this???
!18
![Page 19: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/19.jpg)
Tim Wood - The George Washington University - Department of Computer Science
EncyclopediasEncyclopedia Britannica
- 40,000+ articles - 32 hard bound volumes (32,640 pages)
Microsoft Encarta - 60,000+ articles - 1 CD-ROM (700 MB)
Wikipedia - 5,512,202 articles (in English) - More than 5 TB of text (about 7,500 CDs)
!19
![Page 20: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/20.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Mega whats?
!20
Mega Million 1024 x 1024 = ~1,000,000
Giga Billion 1024 x 1024 X 1024 = ~1,000,000,000
Tera Trillion1024 x 1024 x 1024 x 1024=
~1, 000,000,000,000
700MB vs 5TB
200 songs vs 1.4 million songs
![Page 21: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/21.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Encyclopedias
!21
http://en.wikipedia.org/wiki/Wikipedia:Size_in_volumes
Wikipedia... in print - 1,763 volumes - (no, this does not exist)
Now grown to 2,696 volumes!
![Page 22: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/22.jpg)
Tim Wood - The George Washington University - Department of Computer Science
0.01% of Wikipedia
!22
![Page 23: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/23.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Big Data in Perspective
!23
Facebook - ???
Wikipedia - 5TB of text
![Page 24: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/24.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Big Data in Perspective
!24
Wikipedia - 5TB of text
Facebook - 20TB of photos added each week
![Page 25: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/25.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Big Data in Perspective
!25
Facebook - 1,000TB of photos added per year
![Page 26: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/26.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Big Data in Perspective
!26
Google - 20,000TB of data processed per day
4,000 wikipedias
Facebook - 1,000TB of photos added per year
![Page 27: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/27.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Big Data in Perspective
!27
Google - 20,000TB of data processed per day - in 2008
Facebook - 1,000TB of photos added per year
4,000 wikipedias
![Page 28: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/28.jpg)
How can google process so much information so
quickly?
![Page 29: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/29.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Processing Data Quickly
!29
1. 3 + 6 = ? 2. 10 - 2 = ? 3. 5*4 - 10 = ? 4. -1 + 3 = ? 5. 5*9 = ? 6. 8*8 + 3 = ? 7. 56 + 2 = ? 8. 1 + 4*2 = ? 9. 6*4 + 1 = ? 10. 3 * 2 * 3 = ?
Buy a faster computer
Buy another computer
![Page 30: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/30.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Processing Data in PARALLEL
!30
1. 5 + 2 = ? 2. 11 - 2 = ? 3. 14 - 3 = ? 4. -1 + 3 = ? 5. 5*9 = ? 6. 8*8 + 3 = ? 7. 56 + 2 = ? 8. 1 + 4*2 = ? 9. 6*4 + 1 = ? 10. 3 * 2 * 3 = ? 11. 1 + 2 + 5 = ? 12. 6 - 2 = ?
![Page 31: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/31.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Let's try it at scale
I have lots of questions I need answered... help me out!
18 questions per page, 40 pages... how long should it take?
!31
![Page 32: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/32.jpg)
Tim Wood - The George Washington University - Department of Computer Science
How could we do better?What problems did we hit?
How could we optimize the process?
What things can't we prevent?
!32
![Page 33: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/33.jpg)
How do we solve problems at large scale?
![Page 34: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/34.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Distributed Systems
Definition?
Examples?
!34
• ???
![Page 35: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/35.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Distributed Systems“A distributed system is one in which components located at networked computers communicate and coordinate their actions only by passing messages.”
Does this cover all of our examples? Is it too broad? Too narrow?
!35
![Page 36: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/36.jpg)
Tim Wood - The George Washington University - Department of Computer Science
How to count words?
!36
is 54621
and 34213
are 30931
cat 20021
How would you count the frequency of each word in a large collection of text documents?
- 100M documents are stored in a distributed storage system - You have 50 worker nodes that store and process data
Input: lots of text documents Desired output:
![Page 37: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/37.jpg)
Tim Wood - The George Washington University - Department of Computer Science
How should we do this at scale?
!37
![Page 38: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/38.jpg)
Tim Wood - The George Washington University - Department of Computer Science
A Real Distributed SystemLet’s consider Map Reduce / Hadoop
Data analytic platform originally proposed by Google
Hadoop is an open source version developed by Yahoo and others
Goal: Make it easy to use a cluster for large scale data processing tasks
- Parsing through web documents to determine content - Analyzing genetic data to identify diseases
!38
![Page 39: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/39.jpg)
Tim Wood - The George Washington University - Department of Computer Science
It starts with data
!39
DataNode1
1) HDFS provides an abstract interface to a storage cluster
Data is replicated 3 times, and accesses are sent to nearby replicas
DataNode2
DataNode3
NameNode
I want data block 3,457
Check Node 2!
DataNode4
![Page 40: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/40.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Hadoop programs
!40
2) A user writes a program that takes some input data and processes it in stages:
Map the input data to a set of keys and values
Shuffle the data so all identical output keys are processed by the same server
Reduce all output keys to a final result
(more details later)
![Page 41: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/41.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Job scheduling
!41
3) User submits their program as a job to the JobTracker
Job Tracker
4) The Job Tracker splits job into small tasks and distributes them across Task Trackers
Task Tracker
Task Tracker
Task Tracker
Task Tracker
Prefers to place a task next to the data node with its data
DataNode
DataNode
DataNode
Name Node
![Page 42: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/42.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Failure handling
!42
Task Tracker
Task Tracker
Task Tracker
Task Tracker
Job Tracker5) If a task fails or is too slow, the Job Tracker restarts it on another task tracker.
Task Tracker
Failures and stragglers are masked
![Page 43: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/43.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Hadoop Components
!43
One server may have multiple roles
1 JobTracker - Centralized controller - Assigns tasks to workers, monitors for failures
Lots of TaskTrackers - Worker nodes - Execute code and return results to Job Tracker
1 NameNode - Centralized storage controller - Determines where data is placed
Lots of DataNodes - Storage nodes - Hold data
![Page 44: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/44.jpg)
Tim Wood - The George Washington University - Department of Computer Science
How to count words?
!44
is 54621
and 34213
are 30931
cat 20021
How would you count the frequency of each word in a large collection of text documents?
- 100M documents are stored in a distributed storage system - You have 50 worker nodes that store and process data
Input: lots of text documents Desired output:
![Page 45: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/45.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Map Reduce Flow
!45
input files
Input items(Hello, my name is…)(What is your name..)(What is my favorite…)…
Sortingand
Shuffling
hello, [1]my, [1]name, [1,1]is, [1,1,1]what, [1,1]your, [1]
Key, list of values
func()
ReduceOutputname, 2is, 3what, 2…
Map
func()
Key-Value listhello, 1my, 1name, 1is, 1what, 1is, 1your, 1
![Page 46: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/46.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Map & Reduce
!46
Map Phase - input: data element - Convert input data into an intermediate result - All map functions can be called independently in parallel - output: {list of keys and values}
Shuffle and partition - Sort all outputs by key and combine values into a list - Partition keys to create new Reduce tasks - output: Key, {list of values}
Reduce Phase - input: Key, {list of values} - Combine the list of values for each key to produce an output
![Page 47: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/47.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Map/Reduce HadoopWhat distributed systems concepts did we see?
!47
![Page 48: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/48.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Map/Reduce HadoopWhat distributed systems concepts did we see?
Replication for fault tolerance and scalability Data partitioning Scheduling Automated resource management Abstraction layers
!48
![Page 49: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/49.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Common ProblemsWhat problems arise in the Hadoop design?
- Either problems that it solves or fails to solve
!49
• ???
![Page 50: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/50.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Common ProblemsCommunication protocols Resource management Security Performance Fault tolerance Scalability
!50
Which are Systems challenges and which are Algorithmic challenges?
![Page 51: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/51.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Systems and AlgorithmsThis course is about BOTH
Systems: - How to implement efficient distributed applications and the
mechanisms needed for them to be scalable, secure, and reliable
Algorithms: - How to design intelligent distributed applications and the
policies needed for them to be scalable, secure, and reliable
!51
![Page 52: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/52.jpg)
Distributed System Architectures and
Principles
![Page 53: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/53.jpg)
Tim Wood - The George Washington University - Department of Computer Science
The Cloud
!53
![Page 54: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/54.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Bit Torrent
!54
![Page 55: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/55.jpg)
Tim Wood - The George Washington University - Department of Computer Science
High Performance Computing
!55
![Page 56: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/56.jpg)
Tim Wood - The George Washington University - Department of Computer Science
ChallengesHeterogeneity Openness Security Failure Handling Concurrency Quality of Service Scalability Transparency
!56
![Page 57: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/57.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Scalability
What are some principles that help with scalability?
What are some mechanisms that help with scalability?
!57
![Page 58: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/58.jpg)
Tim Wood - The George Washington University - Department of Computer Science
ScalabilityPrinciples:
- No single machine has all state - Make decisions based on local or partial information - No single point of failure - No global clock
Techniques: - Asynchronous communication - Distribution / Partitioning - Caching - Replication
!58
Are these just for performance?
![Page 59: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/59.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Transparency
What is transparency? Why is it useful?
How do some distributed systems provide transparency?
!59
![Page 60: Distributed Systems CS6421 - gwdistsys18.github.io · Topic 7 Batched Big Data (Hadoop) Topic 8 Streaming Big Data (Storm) Topic 10 Fault Tolerance, Ordering, and Consistency (DynamoDB)](https://reader030.vdocuments.mx/reader030/viewer/2022040711/5e13e3999c647e4792063c12/html5/thumbnails/60.jpg)
Tim Wood - The George Washington University - Department of Computer Science
Transparency ExamplesTransparency is provided by abstractions that hide the complexity of distributed systems
Dropbox: transparently replicates your files across several computers
MapReduce / Hadoop: abstracts away the complexity of assigning processing tasks to heterogeneous servers
Autonomous middleware: automatically allocates resources to applications / VMs in need
!60