![Page 1: Https://portal.futuregrid.org Virtual Clusters Supporting MapReduce in the Cloud Jonathan Klinginsmith jklingin@indiana.edu School of Informatics and Computing](https://reader035.vdocuments.mx/reader035/viewer/2022062716/56649dd15503460f94ac7762/html5/thumbnails/1.jpg)
https://portal.futuregrid.org
Virtual Clusters Supporting MapReduce in the Cloud
Jonathan [email protected]
School of Informatics and Computing
Indiana University Bloomington
![Page 2: Https://portal.futuregrid.org Virtual Clusters Supporting MapReduce in the Cloud Jonathan Klinginsmith jklingin@indiana.edu School of Informatics and Computing](https://reader035.vdocuments.mx/reader035/viewer/2022062716/56649dd15503460f94ac7762/html5/thumbnails/2.jpg)
https://portal.futuregrid.org 2
Let’s Break this Title Down
Virtual Clusters Supporting MapReduce in the Cloud
![Page 3: Https://portal.futuregrid.org Virtual Clusters Supporting MapReduce in the Cloud Jonathan Klinginsmith jklingin@indiana.edu School of Informatics and Computing](https://reader035.vdocuments.mx/reader035/viewer/2022062716/56649dd15503460f94ac7762/html5/thumbnails/3.jpg)
https://portal.futuregrid.org 3
Let’s Start with MapReduce
• An example to get us warmed up…Mapline = “hello world goodbye world”words = line.split()# [“hello”, “world”, “goodbye”, “world”]
map_results = map(lambda x: (x, 1), words)# [('hello', 1), ('world', 1), ('goodbye', 1), ('world', 1)]
![Page 4: Https://portal.futuregrid.org Virtual Clusters Supporting MapReduce in the Cloud Jonathan Klinginsmith jklingin@indiana.edu School of Informatics and Computing](https://reader035.vdocuments.mx/reader035/viewer/2022062716/56649dd15503460f94ac7762/html5/thumbnails/4.jpg)
https://portal.futuregrid.org 4
Can’t have “MapReduce” without the “Reduce”
Reducefrom operator import itemgetterfrom itertools import groupby
map_results.sort()# [('goodbye', 1), ('hello', 1), ('world', 1), ('world', 1)]
for word, group in groupby(map_results, itemgetter(0)):counts = [count for (word, count) in group]
total = reduce(lambda x, y: x + y, counts)print("{0} {1}".format(word, total))
goodbye 1hello 1world 2
![Page 5: Https://portal.futuregrid.org Virtual Clusters Supporting MapReduce in the Cloud Jonathan Klinginsmith jklingin@indiana.edu School of Informatics and Computing](https://reader035.vdocuments.mx/reader035/viewer/2022062716/56649dd15503460f94ac7762/html5/thumbnails/5.jpg)
https://portal.futuregrid.org 5
What Did We Just Do?“hello world goodbye world”
Split:“hello”, “world”, “goodbye”, “world”
Map:('hello', 1), ('world', 1), ('goodbye', 1), ('world', 1)
Sort:('goodbye', 1), ('hello', 1), ('world', 1), ('world', 1)
Reduce:('goodbye', 1), ('hello', 1), ('world', 2)
![Page 6: Https://portal.futuregrid.org Virtual Clusters Supporting MapReduce in the Cloud Jonathan Klinginsmith jklingin@indiana.edu School of Informatics and Computing](https://reader035.vdocuments.mx/reader035/viewer/2022062716/56649dd15503460f94ac7762/html5/thumbnails/6.jpg)
https://portal.futuregrid.org 6
The “Value” of Knowingthe “Key” Pieces*
Map – creates (key, value) pairs('hello', 1), ('world', 1), ('goodbye', 1), ('world', 1)
Sort by the key:('goodbye', 1), ('hello', 1), ('world', 1), ('world', 1)
Reduce operation peformed on the value:('goodbye', 1), ('hello', 1), ('world', 2)
* = Pun intended
![Page 7: Https://portal.futuregrid.org Virtual Clusters Supporting MapReduce in the Cloud Jonathan Klinginsmith jklingin@indiana.edu School of Informatics and Computing](https://reader035.vdocuments.mx/reader035/viewer/2022062716/56649dd15503460f94ac7762/html5/thumbnails/7.jpg)
https://portal.futuregrid.org 7
In General then…
Split:
Map:
Sort:
Reduce:
![Page 8: Https://portal.futuregrid.org Virtual Clusters Supporting MapReduce in the Cloud Jonathan Klinginsmith jklingin@indiana.edu School of Informatics and Computing](https://reader035.vdocuments.mx/reader035/viewer/2022062716/56649dd15503460f94ac7762/html5/thumbnails/8.jpg)
https://portal.futuregrid.org 8
Check “MapReduce” off the List
Virtual Clusters Supporting MapReduce in the Cloud
![Page 10: Https://portal.futuregrid.org Virtual Clusters Supporting MapReduce in the Cloud Jonathan Klinginsmith jklingin@indiana.edu School of Informatics and Computing](https://reader035.vdocuments.mx/reader035/viewer/2022062716/56649dd15503460f94ac7762/html5/thumbnails/10.jpg)
https://portal.futuregrid.org 10
Compute Cluster
• Set of computers– Proximity– Networking– Storage– Resource Manager
![Page 12: Https://portal.futuregrid.org Virtual Clusters Supporting MapReduce in the Cloud Jonathan Klinginsmith jklingin@indiana.edu School of Informatics and Computing](https://reader035.vdocuments.mx/reader035/viewer/2022062716/56649dd15503460f94ac7762/html5/thumbnails/12.jpg)
https://portal.futuregrid.org 12
Breaking Down Large Problems
Many compute patterns have emerged one such is… Scatter/Gather:
![Page 14: Https://portal.futuregrid.org Virtual Clusters Supporting MapReduce in the Cloud Jonathan Klinginsmith jklingin@indiana.edu School of Informatics and Computing](https://reader035.vdocuments.mx/reader035/viewer/2022062716/56649dd15503460f94ac7762/html5/thumbnails/14.jpg)
https://portal.futuregrid.org 14
What if there are a Lot of Data?
Network Bottleneck?
![Page 15: Https://portal.futuregrid.org Virtual Clusters Supporting MapReduce in the Cloud Jonathan Klinginsmith jklingin@indiana.edu School of Informatics and Computing](https://reader035.vdocuments.mx/reader035/viewer/2022062716/56649dd15503460f94ac7762/html5/thumbnails/15.jpg)
https://portal.futuregrid.org 15
What about Local Node Storage?
• Distribute the data across the nodes (scatter/split)• Replicate the data to prevent data loss• Have the file system keep track of where the chunks (blocks)
are stored• Scheduling resource will schedule jobs to the nodes storing the data
![Page 16: Https://portal.futuregrid.org Virtual Clusters Supporting MapReduce in the Cloud Jonathan Klinginsmith jklingin@indiana.edu School of Informatics and Computing](https://reader035.vdocuments.mx/reader035/viewer/2022062716/56649dd15503460f94ac7762/html5/thumbnails/16.jpg)
https://portal.futuregrid.org 16
MapReduce on the Cluster
Data distributed across the nodes (scatter/split) when loaded into the file system
![Page 17: Https://portal.futuregrid.org Virtual Clusters Supporting MapReduce in the Cloud Jonathan Klinginsmith jklingin@indiana.edu School of Informatics and Computing](https://reader035.vdocuments.mx/reader035/viewer/2022062716/56649dd15503460f94ac7762/html5/thumbnails/17.jpg)
https://portal.futuregrid.org 17
Check “Clusters” off the List
Virtual Clusters Supporting MapReduce in the Cloud
![Page 18: Https://portal.futuregrid.org Virtual Clusters Supporting MapReduce in the Cloud Jonathan Klinginsmith jklingin@indiana.edu School of Informatics and Computing](https://reader035.vdocuments.mx/reader035/viewer/2022062716/56649dd15503460f94ac7762/html5/thumbnails/18.jpg)
https://portal.futuregrid.org 18
Virtual…and…the Cloud
Let’s start with Virtual...• A Virtual Machine (VM)
– A “guest” virtual computer running on a “host” physical computer
• A machine image (MI) is instantiated into a running VM– MI = snapshot of operating system (OS) and any software
![Page 19: Https://portal.futuregrid.org Virtual Clusters Supporting MapReduce in the Cloud Jonathan Klinginsmith jklingin@indiana.edu School of Informatics and Computing](https://reader035.vdocuments.mx/reader035/viewer/2022062716/56649dd15503460f94ac7762/html5/thumbnails/19.jpg)
https://portal.futuregrid.org 19
Virtual…and…the Cloud
The Cloud...• Virtualization + Internet Introduction of the Cloud
– Scalability– Elasticity– Utility computing – not a capital expenditure
• Three levels of service– Software (SaaS) – e.g., Salesforce.com, Web-based email– Platform (PaaS) – e.g., Google App Engine– Infrastructure (IaaS) – e.g., Amazon EC2
![Page 20: Https://portal.futuregrid.org Virtual Clusters Supporting MapReduce in the Cloud Jonathan Klinginsmith jklingin@indiana.edu School of Informatics and Computing](https://reader035.vdocuments.mx/reader035/viewer/2022062716/56649dd15503460f94ac7762/html5/thumbnails/20.jpg)
https://portal.futuregrid.org 20
Why is the Cloud Interesting?
In Industry• Scalability – get scale not present in internal data centers• Elasticity – change scale as capacity demands• Utility computing – no capital investiment
Examples use-cases: • High Performance/Throughput Computing• On-line game development• Scalable web development
![Page 21: Https://portal.futuregrid.org Virtual Clusters Supporting MapReduce in the Cloud Jonathan Klinginsmith jklingin@indiana.edu School of Informatics and Computing](https://reader035.vdocuments.mx/reader035/viewer/2022062716/56649dd15503460f94ac7762/html5/thumbnails/21.jpg)
https://portal.futuregrid.org 21
Why is the Cloud Interesting?
In Academia• Reproduciblity – resuse MIs between researchers • Educational Opportunities
– Virtual environment Variety of uses and configurations– Learn about foundational system components– Collaborate within the same environment
![Page 22: Https://portal.futuregrid.org Virtual Clusters Supporting MapReduce in the Cloud Jonathan Klinginsmith jklingin@indiana.edu School of Informatics and Computing](https://reader035.vdocuments.mx/reader035/viewer/2022062716/56649dd15503460f94ac7762/html5/thumbnails/22.jpg)
https://portal.futuregrid.org 22
Covered “Virtal” and “the Cloud”
Virtual Clusters Supporting MapReduce in the Cloud
Let’s put it all together...
![Page 23: Https://portal.futuregrid.org Virtual Clusters Supporting MapReduce in the Cloud Jonathan Klinginsmith jklingin@indiana.edu School of Informatics and Computing](https://reader035.vdocuments.mx/reader035/viewer/2022062716/56649dd15503460f94ac7762/html5/thumbnails/23.jpg)
https://portal.futuregrid.org 23
MapReduce Virtual Clusters in the Cloud
• Create virtual clusters running MapReduce– Test algorithms– Test infrastructure and other system attributes
![Page 24: Https://portal.futuregrid.org Virtual Clusters Supporting MapReduce in the Cloud Jonathan Klinginsmith jklingin@indiana.edu School of Informatics and Computing](https://reader035.vdocuments.mx/reader035/viewer/2022062716/56649dd15503460f94ac7762/html5/thumbnails/24.jpg)
https://portal.futuregrid.org 24
MapReduce Virtual Clusters in the Cloud
• Research Areas– Bioinformatics – e.g., Genomic Alignments– Data/Text Mining and Processing– Large-scale Graph Algorithms
![Page 25: Https://portal.futuregrid.org Virtual Clusters Supporting MapReduce in the Cloud Jonathan Klinginsmith jklingin@indiana.edu School of Informatics and Computing](https://reader035.vdocuments.mx/reader035/viewer/2022062716/56649dd15503460f94ac7762/html5/thumbnails/25.jpg)
https://portal.futuregrid.org 25
MapReduce Virtual Clusters in the Cloud
• Research Areas– Bioinformatics – e.g., Genomic Alignments– Data/Text Mining and Processing– Large-scale Graph Algorithms
![Page 26: Https://portal.futuregrid.org Virtual Clusters Supporting MapReduce in the Cloud Jonathan Klinginsmith jklingin@indiana.edu School of Informatics and Computing](https://reader035.vdocuments.mx/reader035/viewer/2022062716/56649dd15503460f94ac7762/html5/thumbnails/26.jpg)
https://portal.futuregrid.org 26
From Virtual Clustersto a Local Sandbox
• Use a local sandbox to cover MapReduce topics