hpcc systems engineering summit presentation - improving thor data loading using parallel...
TRANSCRIPT
![Page 1: HPCC Systems Engineering Summit Presentation - Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a22d401a28abe8348b458a/html5/thumbnails/1.jpg)
Improving Thor Data Loading using
Parallel Format-agnostic Direct Spraying
Mohammad Rashti – RNET Technologies
1
![Page 2: HPCC Systems Engineering Summit Presentation - Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a22d401a28abe8348b458a/html5/thumbnails/2.jpg)
Outline Data Loading in HPCC Systems Potential Optimizations
iNFORMER - The Big Picture of the SBIR project iNFORMER Modules
The HPCC on AWS Case – S3 Parallel Data Access Feasibility study and experiments
Ongoing and Future Work
2
![Page 3: HPCC Systems Engineering Summit Presentation - Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a22d401a28abe8348b458a/html5/thumbnails/3.jpg)
Data Loading in HPCC Systems
Potential Optimizations
3
![Page 4: HPCC Systems Engineering Summit Presentation - Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a22d401a28abe8348b458a/html5/thumbnails/4.jpg)
Data Loading in HPCC Systems
4
In Thor, the data needs to be transferred into a single drop zone (Landing Zone)
Landing zone node sprays the data over the DFS on the Thor nodes
Imposes overhead on data loading Landing zone and spraying can become a
bottleneck Especially for large data sets or partial loads
![Page 5: HPCC Systems Engineering Summit Presentation - Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a22d401a28abe8348b458a/html5/thumbnails/5.jpg)
Optimizations for Spraying Process Eliminate the landing zone bottleneck Read data records directly from their original data format
in their original place
Use distributed data handling Use multiple Thor nodes to load (and potentially
decompress) the data Each Thor node can read its own data Load the data directly into destination files on DFS
Optimize partial data loads Add support for in-memory processing Skip trips to disk for iterative computing
5
![Page 6: HPCC Systems Engineering Summit Presentation - Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a22d401a28abe8348b458a/html5/thumbnails/6.jpg)
iNFORMER
The Big Picture of the SBIR Project
6
![Page 7: HPCC Systems Engineering Summit Presentation - Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a22d401a28abe8348b458a/html5/thumbnails/7.jpg)
SBIR Project This work is part of an SBIR project from the
Department of Energy Carried on in two phases Phase I (9 months)
Feasibility study of the ideas and commercial potential Design of the prototype
Phase II (2 years) Prototype implementation Commercialization of the product Competitive renewal proposal submission
The Phase I is currently in progress (ending in Dec.) Phase II may start in 2Q 2015
7
![Page 8: HPCC Systems Engineering Summit Presentation - Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a22d401a28abe8348b458a/html5/thumbnails/8.jpg)
iNFORMER’s General Architecture iNFORMER is the ultimate product developed in this
project It is designed to provide: Efficient batch and in-situ data analytics Low-overhead native-format data access through a unified API
iNFORMER components can be integrated into HPCC Systems
Next slide shows a big-picture diagram of: iNFORMER components (the dark boxes) Where it can fit in HPCC systems.
8
![Page 9: HPCC Systems Engineering Summit Presentation - Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a22d401a28abe8348b458a/html5/thumbnails/9.jpg)
iNFORMER in HPCC – The Big Picture
DFS Any FS Various data stores
Thor Cluster iNFORMER MATE
data transformation
Access API
iNFORMER Virtual Data Integrator
ECL Program
Big Picture view of iNFORMER components and potential integration with HPCC
new module
existing link new link
Landing Zone
9
![Page 10: HPCC Systems Engineering Summit Presentation - Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a22d401a28abe8348b458a/html5/thumbnails/10.jpg)
iNFORMER MATE MATE is meant to replace MapReduce Using the concept of Reduction Object Replace map and reduce with generalized
reduction Avoid intermediate shuffle/sort of key/value
pairs Main benefits of the Reduction Object Not tied to the limitations of the key/value model More efficient data processing with less data
movement Lower memory requirements Performance improvement: 1.50 to 20 fold
10
![Page 11: HPCC Systems Engineering Summit Presentation - Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a22d401a28abe8348b458a/html5/thumbnails/11.jpg)
Virtual Data Integrator (VDI) VDI offers a unified API for native-format data access Access the data in their original storage format
Improves data access experience over various file-systems/data formats Parallel File Systems: Thor DFS, HDFS, PVFS, GPFS, etc. Data Formats: Plain Text, XML, NetCDF, HDF5, etc.
Eliminates the need to transform the entire data into a platform-specific format (e.g., DFS) Especially for in-memory processing
VDI has two parts: Data Storage Adapter (DSA) Virtual Storage API (VSA)
11
![Page 12: HPCC Systems Engineering Summit Presentation - Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a22d401a28abe8348b458a/html5/thumbnails/12.jpg)
iNFORMER Virtual Data Integrator (VDI) with HPCC
Data Storage Adapter (DSA)
Modules
data records
configure with FS/format specifics
Virtual Storage
API (VSA) FS/format customized
modules
Fixed API
HPCC Systems’ Thor
Any FS (e.g., S3)
HPCC DFS
data copy
iNOFORMER VDI and potential integration with HPCC Systems
new data flow existing data flow
Drop (Landing)
Zone
data transformation
12
![Page 13: HPCC Systems Engineering Summit Presentation - Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a22d401a28abe8348b458a/html5/thumbnails/13.jpg)
Data Storage Adapter (DSA)
A set of adapters that interface with various storage formats Access the respective data records in their original format
Consolidates data integration process Allows for partial data load/access Eliminates the need for excessive data transformation Data records on the distributed FS will be extracted and
forwarded to the analysis engine and back.
13
![Page 14: HPCC Systems Engineering Summit Presentation - Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a22d401a28abe8348b458a/html5/thumbnails/14.jpg)
Some Internals of the DSA Component
Internal Architecture of the DSA File Format Adapter 14
![Page 15: HPCC Systems Engineering Summit Presentation - Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a22d401a28abe8348b458a/html5/thumbnails/15.jpg)
Virtual Storage API (VSA) VSA is a unified FS/format agnostic interface: The data access calls are not bound to a format The storage format specifics will guide the VSA implementation
to choose the right DSA module
VSA will have well-defined hooks for easy implementation of modules for new formats
A set of pre-implemented FS/format adapters will be developed.
15
![Page 16: HPCC Systems Engineering Summit Presentation - Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a22d401a28abe8348b458a/html5/thumbnails/16.jpg)
The HPCC on AWS Case
S3 Parallel Data Access
16
![Page 17: HPCC Systems Engineering Summit Presentation - Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a22d401a28abe8348b458a/html5/thumbnails/17.jpg)
Phase I - HPCC Feasibility Study
17
Efficient loading of AWS S3-resident compressed files into Thor
Data loading / spraying optimizations in this project Distributed data load / decompress / read from S3 into the
memory of Thor nodes Distributed data Spray to DFS
Each node should load its own portion of data to avoid shuffling (communication) Use a zip index file to facilitate this
Study the potential for in-memory processing Avoid multiple disk accesses
![Page 18: HPCC Systems Engineering Summit Presentation - Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a22d401a28abe8348b458a/html5/thumbnails/18.jpg)
S3 to EC2 Distributed Data Load
18
Two potential plans for the proposed distributed S3 loading module
Test data: Elsevier sample data sets hosted on S3 XML-based records Multi-entry zip files
Amazon S3
Thor @EC2
S3 Parallel Load / Decompress
Kafka Producer
Kafka to DFS
Spray
Memory to DFS Spray
1
2
Landing Zone
Decompress & Spray
S3 Load
Current approach
![Page 19: HPCC Systems Engineering Summit Presentation - Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a22d401a28abe8348b458a/html5/thumbnails/19.jpg)
Development Steps
19
Step 1: Distributed Spraying (currently implemented) Parallel loading, decompression and spraying of XML files from an S3
archive Each process extracts an XML file data directly from the archive
One process per node Load balancing applied among the processes Message Passing Interface (MPI) used for inter-node communication
and shuffling Step II: Skipping the Shuffling and the Landing Zone (next step) Each process reads it own relevant portion of the file directly from the
archive and decompresses on the fly. Allows for partial data load from S3 May need full compression index for the archives
A separate module to be developed to do this
![Page 20: HPCC Systems Engineering Summit Presentation - Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a22d401a28abe8348b458a/html5/thumbnails/20.jpg)
Step I – Distributed Spraying
20
Extract XML file Decompress Spray
(shuffle) DFS
Processing on four (4)
EC2-based Thor Nodes
Multi-file Zip
archive
Amazon S3
![Page 21: HPCC Systems Engineering Summit Presentation - Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a22d401a28abe8348b458a/html5/thumbnails/21.jpg)
Step I Results
21 Comparing Sequential and Parallel Load/Spray from S3 into Thor
![Page 22: HPCC Systems Engineering Summit Presentation - Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a22d401a28abe8348b458a/html5/thumbnails/22.jpg)
Ongoing and Future Work
22
![Page 23: HPCC Systems Engineering Summit Presentation - Improving Thor Data Loading using Parallel Format-agnostic Direct Spraying](https://reader034.vdocuments.mx/reader034/viewer/2022052700/55a22d401a28abe8348b458a/html5/thumbnails/23.jpg)
Future Development Steps Next immediate steps (SBIR Phase I) Complete spraying into Thor DFS Evaluation using more data sets Fully eliminate the scattering process by partial file reading Design the zip indexing module Full VDI module prototype design
Future (SBIR Phase II) Plans S3 adapter module as part of the VDI Expansion of the adapter module to support general non-S3
cases Full integration with HPCC to bypass the landing zone Support for in-memory spraying and data processing 23