sdquery dsi: integrating data management support with a wide area data transfer protocol

25
SC 2013 SDQuery DSI: Integrating Data Management Support with a Wide Area Data Transfer Protocol Yu Su*, Yi Wang*, Gagan Agrawal*, Rajkumar Kettimuthu # *The Ohio State University # The University of Chicago and Argonne National Laboratory

Upload: tamber

Post on 24-Feb-2016

32 views

Category:

Documents


0 download

DESCRIPTION

SDQuery DSI: Integrating Data Management Support with a Wide Area Data Transfer Protocol. Yu Su*, Yi Wang*, Gagan Agrawal*, Rajkumar Kettimuthu # *The Ohio State University # The University of Chicago and Argonne National Laboratory. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SDQuery DSI:  Integrating Data Management Support with a Wide Area Data Transfer Protocol

SC 2013

SDQuery DSI: Integrating Data Management Support

with a Wide Area Data Transfer Protocol

Yu Su*, Yi Wang*, Gagan Agrawal*, Rajkumar Kettimuthu#

*The Ohio State University#The University of Chicago and

Argonne National Laboratory

Page 2: SDQuery DSI:  Integrating Data Management Support with a Wide Area Data Transfer Protocol

SC 2013

Motivation

• Science becomes increasingly data driven• Strong requirements for efficient data analysis• “Big Data” Challenge:

– Fast data generation speed– Slow disk I/O and network speed – Some number from road-runner EC3 simulation

• 40003 particles, 36 bytes per particle => 2.3 TB• Network Bandwidth: GB level or even less• Huge difference between simulation and network• Gap will become bigger in future

Page 3: SDQuery DSI:  Integrating Data Management Support with a Wide Area Data Transfer Protocol

SC 2013

Wide-Area Data Transfer Protocols

• Efficient data transfers over wide-area network• Globus GridFTP:

– Striped, Streaming, Parallel Data Transfer – Reliable and Restartable Data Transfer

• Limitation: volume?– The basic data transfer unit is file (GB or TB Level)– Strong requirements for transferring data subsets

• Climate Simulation, Tomography, XPCS• An Example

• Goal: Integrate core data management functionality with wide-area data transfer protocols

Page 4: SDQuery DSI:  Integrating Data Management Support with a Wide Area Data Transfer Protocol

SC 2013

Challenges

• How should the method be designed to allow easy use and integration with existing GridFTP installation?

• How can users view a remote file and conveniently specify the subsets of data that is of interest to them?

• How to support efficient data retrieval with different subsetting scenarios (index-based retrieval or data block loading + in-memory filter)?

• How can data retrieval be parallelized and benefits from multi-steaming?

Page 5: SDQuery DSI:  Integrating Data Management Support with a Wide Area Data Transfer Protocol

SC 2013

Introduction• GridFTP SDQuery DSI (Scientific Data Query Data

Storage Interface)– Efficient Data Transfer over Flexible File Subset– Dynamic Loading / Unloading– HDF5 and NetCDF Data Formats– Standard SQL Embedded in Data Download Request – Multiple Query Types (Dims, Coordinates, Values)

• Bitmap Indexing• Metadata View of Data File

• Features: – Performance Model based Hybrid Data Reading– Parallel Streaming Data Reading and Transferring

Page 6: SDQuery DSI:  Integrating Data Management Support with a Wide Area Data Transfer Protocol

SC 2013

Background: Globus GridFTP

• Support Efficient Data Transfer in Grid Community– 3500+ server, 1PB+ transfer/day

• DSI(Data Storage Interface): – Compatible with different file

systems or platforms– An adapter between GridFTP and

system• SDQuery DSI:

– Dynamic loading with small overhead

– Seamless integration with GridFTP data transfer features (Fault Tolerance, Security, Automatic TCP optimization)

Page 7: SDQuery DSI:  Integrating Data Management Support with a Wide Area Data Transfer Protocol

SC 2013

Background: Bitmap Indexing

• Widely used in scientific data management

• Suitable for float value by binning small ranges• Run Length Compression(WAH, BBC)

– Compress bitvector based on continuous 0s or 1s

Page 8: SDQuery DSI:  Integrating Data Management Support with a Wide Area Data Transfer Protocol

SC 2013

System ArchitectureGridFTP Client GridFTP Client GridFTP Client

GridFTP Server

HDF5, NetCDF DatasetIndices and

schema

File Receiver

Index Generation

Schema Management

Query Analysis

Index Operations

Data Reader

File Sender

Request Parser

SDQuery DSI

File DSI

File Receiver File Reader

data storerequest

schemarequest

data retrieverequest

Receive Data File

Build Multi-level Bitmap Indexing

Generate Metadata View

Query Metadata View

Parse SQL query

Indexing and find all data pos

Read Data based on data pos

Send File

Page 9: SDQuery DSI:  Integrating Data Management Support with a Wide Area Data Transfer Protocol

SC 2013

Metadata ViewPhysical Storage DescriptorTEMP = /tmp/server/POP.ncVVEL = /tmp/server/POP.nc……

Logical Layout DescriptorVarname = “TEMP”Data Type: NC_FLOATDims (time, depth, t_lat, t_lon)Coordinate Values: t_lon………

Value Distribute DescriptorMin/Max Value: (-21.1, 33.1)

Logical Layout DescriptorVarname = “VVEL”Data Type: NC_FLOATDims (time, depth, u_lat, u_lon)Coordinate Values: u_lon………

Value Distribute DescriptorMin/Max Value: (-246, 225)

Page 10: SDQuery DSI:  Integrating Data Management Support with a Wide Area Data Transfer Protocol

SC 2013

An User Case

• Translate Analysis Requirement into Query: – Find the data elements under the depth of 50 meters of the

ocean and the temperature is larger than 5 centigrade.

Client-side Request Examplesglobus-url-copy "ftp://127.0.0.1:5000/tmp/server/POP.nc" file:///tmp/client/netcdfsubset/

globus-url-copy "ftp://127.0.0.1:5000/tmp/server/POP.nc(SELECT TEMP FROM POP.nc WHERE TEMP >=5 AND depth>50)" file:///tmp/client/netcdfsubset/

POP.nc

TEMP(Query).nc

Less Than 5% Data Transfer!

Page 11: SDQuery DSI:  Integrating Data Management Support with a Wide Area Data Transfer Protocol

SC 2013

Performance Model-based Data Subset Retrieval

• Data Retrieval Process: – Query Analysis and Index Operations - Fast– Know how much data to fetch after index operations: – Data Reader – Slow

• Data Reading Choices: – Direct Access: Smaller Data Subset

• Directly read data by points or segments from disk– Memory Filter: Bigger Data Subset

• Load the data blocks into memory and filter– Which method is more efficient to choose is tricky

• Execution Environment, Data Format and Dataset

Page 12: SDQuery DSI:  Integrating Data Management Support with a Wide Area Data Transfer Protocol

SC 2013

Performance Model

• Profiling and formulate data reading– Memory Filter:

– Direct Access (Points):

– Direct Access(Segments):

• Offline Training based on random query set – Parameters are trained and classified based on subset percent– Apply formulas for each real query– Select more efficient methods for data reading

Page 13: SDQuery DSI:  Integrating Data Management Support with a Wide Area Data Transfer Protocol

SC 2013

Parallel Streaming

• Multi-Thread Data Retrieval and Transfer:– Data retrievals are performed in parallel– Data transfers are performed in parallel to better utilize

the bandwidth– Data retrievals and data transfers are performed in a

pipeline mode• Bit-1 distribution based data partition:

– Partition result bitset based on thread number– Great load balance for both data retrieval and transfer– Small partition cost

• One pass for both bits segmenting and partition• Use multi-thread to speedup

Page 14: SDQuery DSI:  Integrating Data Management Support with a Wide Area Data Transfer Protocol

SC 2013

Parallel Streams Example (2 streams)

0 1 1 1 0 1 0 01 1 0 0 0 1 1 00 0 1 1 0 1 1 00 0 0 1 1 1 0 00 1 0 0 0 0 0 00 0 0 1 0 0 0 0

Subset Size: 12

Subset Size: 5

Load Imbalance

Subset Size: 8

Subset Size: 9

Chunk0 ChunknChunk1

Chunk0 ChunknChunk1

S0 S1 S2 … Sn

S0 S1 S2 … Sn

Sending Queue 1

Sending Queue 2

TCP stream

TCP stream

T11: waiting…

T10: reading… T10: reading…

T21: waiting…

T20: reading… T20: reading…

T11: sending…

T21: sending…

……

……

Dim-based PartitionBit1-based Partition One pass: Generate Segs and Count

Page 15: SDQuery DSI:  Integrating Data Management Support with a Wide Area Data Transfer Protocol

SC 2013

Experiment Results• Goals:

– Compare SDQuery DSI with GridFTP default File DSI– Show the effectiveness of perform-model based

selection between direct access and memory filter– Speedup for using parallel streaming data transfer

• Datasets: – NetCDF: Parallel Ocean Programs (POP)– HDF5: Mediterranean Ocean Data Base (MODB)

• Environment: – RI Cluster: 100 nodes, 8 cores 2.53 GHz Intel(R) Xeon

Processors, 12 GB memory

Page 16: SDQuery DSI:  Integrating Data Management Support with a Wide Area Data Transfer Protocol

SC 2013

SDQuery DSI vs. File DSI

• Compare the total execution time between two DSIs in different network environments

• File DSI (GridFTP default DSI): – Read the entire data file and transfer over network

• Dataset: – 140 GB POP data file– TEMP.nc(time(10), depth(42), lat(2400), lon(3600))

• Three Network Environment: – LAN: 1 Gb/s bandwidth, 0.17 msec RTT– WAN: Avg. 200 Mb/s bandwidth, 24 msec RTT– WAN: Avg. 20Mb/s bandwidth, 60 msec RTT

Page 17: SDQuery DSI:  Integrating Data Management Support with a Wide Area Data Transfer Protocol

SC 2013

SDQuery vs. File DSI (1Gb)

• SDQuery Query Processing Time: Query parsing and bitmap indexing time• SDQuery Subset and Transfer Time: Data subset fetching and transfer time • File Read and Transfer Time: Entire data file reading and transfer time

• Data file: 140 GB• Input of SDQuery DSI:

• 2000 queries cover different data subset percentage

• When the data subset percentage is <50%, SDQuery DSI is better, the speedup is 1.26 to 9.41

• Otherwise: FileDSI achieves better efficiency

Page 18: SDQuery DSI:  Integrating Data Management Support with a Wide Area Data Transfer Protocol

SC 2013

SDQuery vs. File DSI (200 Mb)

• SDQuery Query Processing Time: Query parsing and bitmap indexing time• SDQuery Subset and Transfer Time: Data subset fetching and transfer time • File Read and Transfer Time: Entire data file reading and transfer time

• Same data and same input• Network transfer time

becomes the main bottleneck• SDQuery DSI: Query

Process Time: 9% - 40% of Total Execution Time

• Compared to File DSI, SDQuery DSI achieves better efficiency for all cases. The speedup is from 1.15 to 29.07

Page 19: SDQuery DSI:  Integrating Data Management Support with a Wide Area Data Transfer Protocol

SC 2013

SDQuery vs. File DSI (20 Mb)

• SDQuery Query Processing Time: Query parsing and bitmap indexing time• SDQuery Subset and Transfer Time: Data subset fetching and transfer time • File Read and Transfer Time: Entire data file reading and transfer time

• In a common wide area network environment where bandwidth is really limited.

• Network transfer time becomes the dominant factor

• SDQuery DSI: Query Process Time: 1% - 9% of Total Execution Time

• SDQuery DSI achieves better efficiency for all cases. The speedup is from 1.21 to 81.32

Page 20: SDQuery DSI:  Integrating Data Management Support with a Wide Area Data Transfer Protocol

SC 2013

Accuracy of Performance Model

• X axis: data subset percentage• Y axis: only data subset reading time

• Direct Access, Memory Filter

• Data Access (points): frequent data seeking, inefficient

• Data Access (segments): average seg length: 300.36, speedup: 1.64 – 3.93

• Memory Filter: Similar for all different cases

• Data Access (segments) and Memory Filter method achieve same performance when subset percentage is around 62%

• Hybrid Access: right choice in most case (except 60% - 70%)

Page 21: SDQuery DSI:  Integrating Data Management Support with a Wide Area Data Transfer Protocol

SC 2013

Speedup Using Parallel Streaming

• X axis: data subset percentage • Y axis: data retrieval and transfer time• Non-overlapping: data is sent back only

after all subset is loaded into memory

• Benefits: • Parallel TCP Streams• Parallel Data Retrieval• Data Retrieval and

transfer overlap• Dataset: 10.5 GB MODB• Network Speed: 200Mb/s• 1 Steam allows the overlap

between data retrieval and data transfer, the speedup is 1.19 – 1.52 compared with non overlapping

• Maximum speedup using 4 streams: 1.57 – 1.75

• Bandwidth is fully utilized

Page 22: SDQuery DSI:  Integrating Data Management Support with a Wide Area Data Transfer Protocol

SC 2013

Conclusion• ‘‘Big Data’’ issue brings challenges for scientific

data management• SDQuery DSI: a GridFTP plug-in to support

flexible data subsetting over HDF5 and NetCDF• Seamless integration with GridFTP server• Performance model based data retrieval method• Parallel steaming data retrieval and transfer

Page 23: SDQuery DSI:  Integrating Data Management Support with a Wide Area Data Transfer Protocol

SC 2013

Contact Us If You’re Interested!

Yu SuEmail: [email protected]

Page 24: SDQuery DSI:  Integrating Data Management Support with a Wide Area Data Transfer Protocol

SC 2013 24

Thanks

Page 25: SDQuery DSI:  Integrating Data Management Support with a Wide Area Data Transfer Protocol

SC 2013

TEMPSALTUVELVVEL

Network

I want to analyze TEMP within Indian Ocean!

More Efficient!

Entire Data File

Data Subset

Back

POP.nc

An Example of Ocean Simulation

GridFTP Server