ipdps - april 20, 2010 jie li 1, deb agarwal 2, marty humphrey 1, keith jackson 2, catharine van...
TRANSCRIPT
![Page 1: IPDPS - April 20, 2010 Jie Li 1, Deb Agarwal 2, Marty Humphrey 1, Keith Jackson 2, Catharine van Ingen 3, Youngryel Ryu 4 University of Virginia eScience](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649cf45503460f949c28e4/html5/thumbnails/1.jpg)
eScience in the Cloud: A MODIS Satellite Data Reprojection and
Reduction Pipeline in the Windows Azure Platform
IPDPS - April 20, 2010
Jie Li1, Deb Agarwal2, Marty Humphrey1, Keith Jackson2, Catharine van Ingen3, Youngryel Ryu4
University of Virginia eScience Group1
Lawrence Berkeley National Lab2
Microsoft Research3
University of California, Berkeley4
1
![Page 2: IPDPS - April 20, 2010 Jie Li 1, Deb Agarwal 2, Marty Humphrey 1, Keith Jackson 2, Catharine van Ingen 3, Youngryel Ryu 4 University of Virginia eScience](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649cf45503460f949c28e4/html5/thumbnails/2.jpg)
Background
AzureMODIS Framework Overview
Dynamic Scalability & Fault Tolerance
Evaluation
Conclusions & Future Work
Outline
2
![Page 3: IPDPS - April 20, 2010 Jie Li 1, Deb Agarwal 2, Marty Humphrey 1, Keith Jackson 2, Catharine van Ingen 3, Youngryel Ryu 4 University of Virginia eScience](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649cf45503460f949c28e4/html5/thumbnails/3.jpg)
Increasing data availability for science discoveries◦ Growing data size from large scientific instruments◦ Emerging large-scale inexpensive ground-based sensors
Computational models with increasing complexities and precisions
Data-intensive eScience: Opportunities
Raw Data
Scientific Results
?3
Resources?Apps
&Tools?
![Page 4: IPDPS - April 20, 2010 Jie Li 1, Deb Agarwal 2, Marty Humphrey 1, Keith Jackson 2, Catharine van Ingen 3, Youngryel Ryu 4 University of Virginia eScience](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649cf45503460f949c28e4/html5/thumbnails/4.jpg)
Moderate Resolution Imaging Spectroradiometer Satellites:◦ Viewing the entire Earth's
surface every 1 to 2 days◦ Acquiring data in 36 spectral
bands◦ Multiple data products
(Atmosphere, Land, Ocean etc.)◦ Important for understanding
global environment and earth system models
MODIS Basics
http://aqua.nasa.gov/doc/viz/media/aqua_orbit_sm.mpg 4
![Page 5: IPDPS - April 20, 2010 Jie Li 1, Deb Agarwal 2, Marty Humphrey 1, Keith Jackson 2, Catharine van Ingen 3, Youngryel Ryu 4 University of Virginia eScience](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649cf45503460f949c28e4/html5/thumbnails/5.jpg)
Data Collection◦ Multiple FTP sites for MODIS source data◦ Metadata maintained separately
Data Heterogeneity◦ Different time granularities and imaging resolutions◦ Two different project types: “Swath” and “Sinusoidal”
Data Management◦ Current use case: 10 years of data covering US continent◦ 5 TB source data (~600,000 files)◦ 2 TB timeframe- and space-aligned harmonized data◦ ~50000 CPU hours of parallel computation
Barriers for Using MODIS Data
5
![Page 6: IPDPS - April 20, 2010 Jie Li 1, Deb Agarwal 2, Marty Humphrey 1, Keith Jackson 2, Catharine van Ingen 3, Youngryel Ryu 4 University of Virginia eScience](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649cf45503460f949c28e4/html5/thumbnails/6.jpg)
A MODIS Data Processing Framework in Microsoft Windows Azure cloud computing platform◦ Leverage scalability of cloud infrastructure and services◦ Dynamic, on-demand resource provisioning◦ Automate data processing tasks to eliminate barriers◦ A generic Reduction Service to run arbitrary analysis
executables
AzureMODIS: A Client+Cloud Solution
MODIS Source Data
Scientific Results
Windows AzureCloud Computing Platform
AzureMODIS Service Framework
6
![Page 7: IPDPS - April 20, 2010 Jie Li 1, Deb Agarwal 2, Marty Humphrey 1, Keith Jackson 2, Catharine van Ingen 3, Youngryel Ryu 4 University of Virginia eScience](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649cf45503460f949c28e4/html5/thumbnails/7.jpg)
Background
AzureMODIS Framework Overview
Dynamic Scalability & Fault Tolerance
Evaluation
Conclusions & Future Work
Outline
7
![Page 8: IPDPS - April 20, 2010 Jie Li 1, Deb Agarwal 2, Marty Humphrey 1, Keith Jackson 2, Catharine van Ingen 3, Youngryel Ryu 4 University of Virginia eScience](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649cf45503460f949c28e4/html5/thumbnails/8.jpg)
Hosted Services◦ Web Role: Host web applications via an HTTP and/or an
HTTPS endpoint◦ Worker Role: Host user-customized code/applications
Storage Services◦ Blob service: Storage for entities in the form of binary
bits◦ Queue Service: A reliable, persistent queue model for
message-based communication between instances◦ Table Service: Structured storage in the form of tables,
with simple query support
Windows Azure Platform Basics
8
![Page 9: IPDPS - April 20, 2010 Jie Li 1, Deb Agarwal 2, Marty Humphrey 1, Keith Jackson 2, Catharine van Ingen 3, Youngryel Ryu 4 University of Virginia eScience](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649cf45503460f949c28e4/html5/thumbnails/9.jpg)
AzureMODIS Data Processing Service
9
1. Scientist submits requests for computation on the web portal
2. The request is received and processed by the service monitor
3. Service Workers query the metadata in Azure tables to download source
4. The specified source data are uploaded to the Azure blob storage
5. The heterogeneous sources are reprojected into uniform format
6. Scientist uploads arbitrary executables to work on the uniform data
7. A single download link to the results is sent back to the scientist
![Page 10: IPDPS - April 20, 2010 Jie Li 1, Deb Agarwal 2, Marty Humphrey 1, Keith Jackson 2, Catharine van Ingen 3, Youngryel Ryu 4 University of Virginia eScience](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649cf45503460f949c28e4/html5/thumbnails/10.jpg)
http://modisazure.cloudapp.net/
AzureMODIS Data Service Demo
10
![Page 11: IPDPS - April 20, 2010 Jie Li 1, Deb Agarwal 2, Marty Humphrey 1, Keith Jackson 2, Catharine van Ingen 3, Youngryel Ryu 4 University of Virginia eScience](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649cf45503460f949c28e4/html5/thumbnails/11.jpg)
Behind the scene…
11
User Web Portal
(Web Role)
Job Request
…Job Queue
Service Monitor (Worker Role)
ReductionJobStatus Table
Persist
ReductionTaskStatus Table
…
Dispatch
Task Queue
Parse & Persist
GenericWorker (Worker Role)
…
…
Points to
Sinusoidal Land Source Storage
Reprojected DataStorage
Reduction Result Storage
DownloadLink to Results
![Page 12: IPDPS - April 20, 2010 Jie Li 1, Deb Agarwal 2, Marty Humphrey 1, Keith Jackson 2, Catharine van Ingen 3, Youngryel Ryu 4 University of Virginia eScience](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649cf45503460f949c28e4/html5/thumbnails/12.jpg)
Blob storage level◦ Each data file (blob) has a global unique identifier◦ (Pre-)download and cache all source files in blob storage◦ (Pre-)compute reprojection results for reuse across
computations Local machine level
◦ Each small size instance has ~250GB local storage
◦ Cache large size data files for reuse Cost-related Trade offs
◦ Data re-generation cost VS. Blob storage cost◦ For our case, data re-computation is too expensive
Data Caching
12
![Page 13: IPDPS - April 20, 2010 Jie Li 1, Deb Agarwal 2, Marty Humphrey 1, Keith Jackson 2, Catharine van Ingen 3, Youngryel Ryu 4 University of Virginia eScience](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649cf45503460f949c28e4/html5/thumbnails/13.jpg)
Scientists upload their analysis binary tools upon request for the reduction service
Benefits◦ Scientists can easily debug and refine scientific models in their code◦ Separate system code debugging from science code debugging
A 2nd reduction stage to support more comprehensive computation flows
Reduction Service
13
![Page 14: IPDPS - April 20, 2010 Jie Li 1, Deb Agarwal 2, Marty Humphrey 1, Keith Jackson 2, Catharine van Ingen 3, Youngryel Ryu 4 University of Virginia eScience](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649cf45503460f949c28e4/html5/thumbnails/14.jpg)
Project Background
AzureMODIS Framework Overview
Dynamic Scalability & Fault Tolerance
Evaluation
Conclusions & Future Work
Outline
14
![Page 15: IPDPS - April 20, 2010 Jie Li 1, Deb Agarwal 2, Marty Humphrey 1, Keith Jackson 2, Catharine van Ingen 3, Youngryel Ryu 4 University of Virginia eScience](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649cf45503460f949c28e4/html5/thumbnails/15.jpg)
Use the Azure Management API to dynamically scale up/down instances according to work loads
Dynamic instance shutdown could be a problem◦ Azure decides which instance to shutdown◦ Instances may be shutdown during task execution
Currently, computing instance usage are charged by hours◦ Use CPU hours wisely when applying dynamic scaling
strategies
Dynamic Scalability
15
![Page 16: IPDPS - April 20, 2010 Jie Li 1, Deb Agarwal 2, Marty Humphrey 1, Keith Jackson 2, Catharine van Ingen 3, Youngryel Ryu 4 University of Virginia eScience](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649cf45503460f949c28e4/html5/thumbnails/16.jpg)
In contrast, the shutdown time for the instances is small (usually within 3 minutes)
Performance of dynamic instance scaling
16
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 960
5
10
15
20
25
30
35
1-to-13
1-to-25
1-to-50
1-to-98
Instances
StartUp Time (Minutes)
Instance Start Up Time (Test Date: March 31, 2010)
![Page 17: IPDPS - April 20, 2010 Jie Li 1, Deb Agarwal 2, Marty Humphrey 1, Keith Jackson 2, Catharine van Ingen 3, Youngryel Ryu 4 University of Virginia eScience](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649cf45503460f949c28e4/html5/thumbnails/17.jpg)
Tasks can fail for many reasons◦ Broken or missing source data files — Unrecoverable◦ Reduction tool may crash due to code bug —
Unrecoverable◦ Failures caused by system instability — Recoverable
Customized task retry policies◦ Task with timeout failures will be resent to the task queue◦ Task with exceptions caught will be immediately resent ◦ Task canceled after 2 retries (Totally 3 executions)
Why not just use queue message visibility settings for failure recovery?
Fault Tolerance
17
![Page 18: IPDPS - April 20, 2010 Jie Li 1, Deb Agarwal 2, Marty Humphrey 1, Keith Jackson 2, Catharine van Ingen 3, Youngryel Ryu 4 University of Virginia eScience](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649cf45503460f949c28e4/html5/thumbnails/18.jpg)
http://modisazure.cloudapp.net/
Service Monitoring & Diagnosing (Demo)
18
![Page 19: IPDPS - April 20, 2010 Jie Li 1, Deb Agarwal 2, Marty Humphrey 1, Keith Jackson 2, Catharine van Ingen 3, Youngryel Ryu 4 University of Virginia eScience](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649cf45503460f949c28e4/html5/thumbnails/19.jpg)
Project Background
AzureMODIS Framework Overview
Dynamic Scalability & Fault Tolerance
Evaluation
Conclusions & Future Work
Outline
19
![Page 20: IPDPS - April 20, 2010 Jie Li 1, Deb Agarwal 2, Marty Humphrey 1, Keith Jackson 2, Catharine van Ingen 3, Youngryel Ryu 4 University of Virginia eScience](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649cf45503460f949c28e4/html5/thumbnails/20.jpg)
Desktop Azure Instance
CapacityCPU: Intel Core2Duo E6850 @ 3.0GHZMemory: 4GBHard Disk: 1TB SATANetwork: 1Gbps EthernetOS: Windows 7 (32-bit)
CPU: 1.6GHZ X64 equivalent processorMemory: 2GBLocal Storage: 250GBNetwork: 100MbpsOS: Windows 2008 Server x64 (64-bit)
Overall Performance & Scalability
MOD04_L2 MOD06_L2 MYD11_L2.005
150 instances 0.30 0.85 0.44
100 instances 0.40 1.20 0.61
50 instances 0.76 2.25 1.12
Desktop 16.29 72.62 33.45
Table 3. Processing time for 1500 reprojection tasks (Unit: hours)
Table 2. Capacity of desktop machine and a single Azure instance
20
Fig. 1 Performance speedups over a single desktop
![Page 21: IPDPS - April 20, 2010 Jie Li 1, Deb Agarwal 2, Marty Humphrey 1, Keith Jackson 2, Catharine van Ingen 3, Youngryel Ryu 4 University of Virginia eScience](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649cf45503460f949c28e4/html5/thumbnails/21.jpg)
Storage Service Scalability
21
50VMs 100VMs 150VMs0
20
40
60
80
100
120
ComputationData Transfer
Unit
: H
ours
Accumulated time for data transfer from/to Azure blob storage increases as #VM increases
![Page 22: IPDPS - April 20, 2010 Jie Li 1, Deb Agarwal 2, Marty Humphrey 1, Keith Jackson 2, Catharine van Ingen 3, Youngryel Ryu 4 University of Virginia eScience](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649cf45503460f949c28e4/html5/thumbnails/22.jpg)
Project Background
AzureMODIS Framework Overview
Dynamic Scalability & Fault Tolerance
Evaluation
Conclusions & Future Work
Outline
22
![Page 23: IPDPS - April 20, 2010 Jie Li 1, Deb Agarwal 2, Marty Humphrey 1, Keith Jackson 2, Catharine van Ingen 3, Youngryel Ryu 4 University of Virginia eScience](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649cf45503460f949c28e4/html5/thumbnails/23.jpg)
Cloud computing provides new capabilities and opportunities for data-intensive eScience research
Dynamic scalability is powerful, but instance start up overhead is not trivial
Built-in fault tolerance & diagnostic features are
important in the face of common failures in large-scale cloud applications and systems
Conclusions
23
![Page 24: IPDPS - April 20, 2010 Jie Li 1, Deb Agarwal 2, Marty Humphrey 1, Keith Jackson 2, Catharine van Ingen 3, Youngryel Ryu 4 University of Virginia eScience](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649cf45503460f949c28e4/html5/thumbnails/24.jpg)
Scale up computations from US continent to the global scale
Develop and evaluate a generic dynamic scaling mechanism with AzureMODIS
Evaluate the similarities/differences between our framework and other generic parallel computing frameworks such as MapReduce
Future Work
24
![Page 25: IPDPS - April 20, 2010 Jie Li 1, Deb Agarwal 2, Marty Humphrey 1, Keith Jackson 2, Catharine van Ingen 3, Youngryel Ryu 4 University of Virginia eScience](https://reader036.vdocuments.mx/reader036/viewer/2022062713/56649cf45503460f949c28e4/html5/thumbnails/25.jpg)
Thank you! &
Questions?
25