servicing seismic and oil reservoir simulation data...
TRANSCRIPT
Servicing Seismic and Oil Reservoir Simulation Data
through Grid Data ServicesSivaramakrishnan Narayanan, Tahsin Kurc,
Umit Catalyurek and Joel SaltzMultiscale Computing Lab
Biomedical Informatics DepartmentThe Ohio State Universityhttp://www.bmi.osu.edu
http://www.multiscalecomputing.org
VLDB-DMG'05 2
Joel SaltzGagan AgrawalUmit Catalyurek
Shannon Hastings Vijay S Kumar
Tahsin KurcSteve Langella
Scott OsterTony Pan
Benjamin RuttNarayanan Sivaramakrishnan,
Li WengMichael Zhang
Multiscale Computing Labhttp://www.multiscalecomputing.org
VLDB-DMG'05 3
AnalysisProduction rates, bypass
oil, net present value
WorkflowRun new reservoir
simulations
DataSeismic, well
pressures, reservoir simulations
Generate requests for new simulations, new
seismic studies
Obtain initial, boundary conditions, input parameters for
simulations
Store and index simulation results
Summary data from datasets
Spatio-temporal queries
• Simulate multiple realizations of multiple geostatistical models and production strategies
• Evaluate geologic uncertainty and production strategies simultaneously
• Enable on-demand exploration and comparison of multiple scenarios– Integration of a robust,
Grid-based computational and data handling infrastructure
– Distributed databases of reservoir and geophysical data
– Storage and computing resources at multiple institutions
Implementing effective oil and gas production
VLDB-DMG'05 4
Characteristics and Issues• Spatio-temporal datasets
– Simulations carried out/data captured on 3D meshes over many time steps
– Multiple data attributes per data point (gas pressure, oil saturation, seismic traces, etc).
• Very large datasets– Tens of gigabytes to 100+ TB data
• Lots of simulation runs– Up to thousands of runs for a study are possible
• Data can be stored in distributed collection of files • Distributed datasets
– Data may be captured at multiple locations by multiple groups– Simulations are carried out at multiple sites
• Common operations: subsetting, filtering, interpolations, projections, comparisons, frequency counts
VLDB-DMG'05 5
Data Management, Access and Integration
• Tracking of metadata associated with data– Metadata defining simulation parameters, mesh description,
files associated with simulations, etc.– Metadata defining seismic measurements (location, year,
files storing data, etc.)
• Support for data subsetting and filtering on file-based, distributed datasets
• Support for on-demand data product generation– Track metadata associated with data analysis workflows
• Grid data services and distributed querying– Make data and data products available through Grid service
interfaces
VLDB-DMG'05 6
Applications developers generally prefer storing data in files
Support high level queries on multi-dimensional distributed datasets
Many possible data abstractions, query interfacesGrid virtualized object-relational database or XML databaseGrid virtualized objects with user defined methods invoked to
access and process data
Data Virtualization
Our Approach• Support a basic SQL Select query with a virtual relational table
view or a virtual XML database view• A lightweight layer on top of datasets
– Runtime middleware carries out query execution, query planning
VLDB-DMG'05 7
Middleware Support• Data Virtualization: STORM
– Large data querying capabilities, layered on DataCutter– Distributed data virtualization – Indexing, Subsetting, Data Cluster/Decluster, Parallel Data Transfer
• Data Analysis/Processing Workflows: DataCutter– Component Framework for Combined Task/Data Parallelism– Filtering/Program coupling Service: Distributed C++ component
framework– On demand data product generation
• Distributed Metadata and Data Management: Mobius– Create, manage, version data definitions – Management of metadata and data instances– Data integration
• Grid Data Services (OGSA-DAI)– Defines services and interfaces that can be used by clients to specify
operations on data resources and data
VLDB-DMG'05 8
Data Management, Access, Integration
• Grid-level data services via OGSA-DAI• Management of data definitions and
metadata, XML virtualization via Mobius
• Object-relational virtualization and subsetting of file based datasets via STORM
• On-demand data product generation via DataCutter
• STORM, Mobius, DataCutter support data operations on heterogeneous collections of storage and compute clusters
Schema Management
Mobius
Data ProductGeneration
DataCutter
SQL Virtualizationof FilesSTORM
XML VirtualizationMetadata Management
Mobius
OGSA-DAI
OGSA-DAIOGSA-DAI
OGSA-DAI
Grid Protocols
VLDB-DMG'05 9
Data Management, Access, and Integration
Schema Management
Mobius
Data ProductGeneration
DataCutter
SQL Virtualizationof FilesSTORM
XML VirtualizationMetadata Management
Mobius
Grid-data Service (OGSA-DAI)Grid-data Service (OGSA-DAI)
Data ProductGeneration
DataCutter
SQL Virtualizationof FilesSTORM
XML VirtualizationMetadata Management
Mobius
Data ProductGeneration
DataCutter
SQL Virtualizationof FilesSTORM
XML VirtualizationMetadata Management
Mobius
Grid-data Service (OGSA-DAI)Grid Service
ProtocolsGrid-data Service (OGSA-DAI)
Seismic Data
Simulation Data
Seismic/Simulation Data
VLDB-DMG'05 10
Array #
Component #Component #
Component #
Sp (or CDP) #& position
Receiver group # & positionReceiver group #
& positionReceiver group # & position
50.00
50.00
50.00
Component #Component #
Component #
Array #
Receiver group # & positionReceiver group #
& positionReceiver group # & position
50.00
50.00
50.00
Component #Component #
Component #
Array #
Receiver group # & positionReceiver group #
& positionReceiver group # & position
50.00
50.00
50.00
Data Querying and ProcessingSeismic Data
Geostatistics
Model 1
Model 2
Model n
…
…m realizations
Well Pattern p
Production Strategies
Well Pattern 1
…
Well Pattern 2
Reservoir Simulations
VLDB-DMG'05 11
STORMSupport efficient selection of the data of interest from
distributed scientific datasets and transfer of data from storage clusters to compute clusters
• Data Subsetting Model– Virtual Tables– Select Queries– Distributed Arrays
SELECT <DataElements>FROM Dataset-1, Dataset-2,…, Dataset-nWHERE <Expression> AND <Filter(<DataElement>)>GROUP-BY-PROCESSOR ComputeAttribute(<DataElement>)
VLDB-DMG'05 12
STORM Services• Query• Meta-data• Indexing• Data Source• Filtering• Partition
Generation• Data Mover
VLDB-DMG'05 13
Grid Data Resource• Grid has emerged as an integrated infrastructure for distributed
computation• OGSA-DAI initiative is to deliver high level data management
functionality for the Grid. – Defines services and interfaces that can be used by clients to specify
operations on data resources and data
• OGSA-DAI services can be configured to expose a specific database management system.
• To be a GDS, a service must accept perform documents and return results– Interpretation of perform documents is open to interpretation– Traditionally wrap SQL queries
VLDB-DMG'05 14
STORM Data Resource
Extractor
Filter
Data Mover
Storm Daemon
JDBC Driver
GDS
STORM instance
Data Resource
VLDB-DMG'05 15
Experimental Setup
• All nodes running linux• Gigabit switch
7.3 TB FAStT600 disk array
4 GB memory2 Xeon 2.4 GHz16Xio
1.5 TB local disk8 GB memoryDual 1.4 GHz AMD Optron
8 nodesmob
Mob,0124 * X / 1MX24 bytes6TXm
Xio,161,0562474240 bytes16Seismic
Mob,033153,84084 bytes21Oil Reservoir
Cluster, Num nodes
Dataset (GB)Records (millions)
Record SizeAttributesDataset
VLDB-DMG'05 16
STORM Results
STORM I/O Performance
0
500
1000
1500
2000
2500
3000
3500
4000
4500
1 2 4 8 16
# XIO nodes
Ban
dwid
th (M
B/s
)
2 Threads4 ThreadsMax
Seismic Datasets10-25GB per file. About 30-35TB of Data.
VLDB-DMG'05 17
Comparison with MySQL - 1• Varying table size. • Per tuple cost is lesser
0
20
40
60
80
100
120
0 50 100 150 200
Table Size (million rows)
Exec
utio
n Ti
me
(sec
s)
MySQL-cold
MySQL-hot
STORM-cold
STORM-hot
VLDB-DMG'05 18
Comparison with MySQL - 2• Varying query size• Also compare them as data resources
0
10
20
30
40
0 250000 500000 750000 1000000
Query Size (num of records)
Exec
utio
n Ti
me
(sec
s)
MySQL
STORM
MySQL-DAI
STORM-DAI
VLDB-DMG'05 19
Oil Reservoir Data Results• Improvements due to: treating records as array of bytes, combining results
at client
0
40
80
120
160
0 100000 200000 300000 400000
Query Size (number of records)
Exec
utio
n Ti
me
(sec
s)
STORM
STORM-DAI-o
STORM-DAI-1
STORM-DAI-50
3 DAIs
VLDB-DMG'05 20
Seismic Data Results
0
10
20
30
40
50
0 2000 4000 6000 8000 10000 12000 14000
Query Size (number of records)
Exec
utio
n Ti
me
(sec
s)
STORM
STORM-DAI-o
STORM-DAI-1
2-DAIs
• 96 x 11GB files on 16 nodes
VLDB-DMG'05 21
Conclusions• Overview of work related to Large Scale Scientific Data
Management at Multi-Scale Computing Lab• Exposed STORM as a Grid Data Service
– Results on use case: Oil reservoir management
• For more info / to download STORM, DataCutter, Mobiushttp://www.multiscalecomputing.org
or
http://www.bmi.osu.edu