starting “small” to go big: building a living database · big: building a . living database....
TRANSCRIPT
![Page 1: Starting “small” to go Big: Building a Living Database · Big: Building a . Living Database. ... external data, user generated content. Reliable data flow, infrastructure, pipelines,](https://reader036.vdocuments.mx/reader036/viewer/2022081404/5f056e087e708231d412ec57/html5/thumbnails/1.jpg)
Solutions for Today | Options for Tomorrow
Jennifer Bauer, Jenny DiGiulio, Devin Justman,Lucy Romeo, Kelly Rose, Patrick Wingo
Starting “small” to go Big:Building a Living DatabaseMichael Sabbatino1,2, Baker, D.V. “Vic” 3,4, Rose, K. 1, Romeo, L.1,2, Bauer, J.1, and Barkhurst, A.3,4
POC: [email protected]
1US Department of Energy, National Energy Technology Laboratory, Albany, OR;2AECOM, Albany, OR;3Mid-Atlantic Technology Research & Innovation Center (MATRIC), Morgantown WV;4US Department of Energy, National Energy Technology Laboratory, Morgantown, WV
![Page 2: Starting “small” to go Big: Building a Living Database · Big: Building a . Living Database. ... external data, user generated content. Reliable data flow, infrastructure, pipelines,](https://reader036.vdocuments.mx/reader036/viewer/2022081404/5f056e087e708231d412ec57/html5/thumbnails/2.jpg)
2
Challenges & Needs of Scientific Data Data Access
• ~80% loss of published data after 20 years
Data Discovery• 20% public data versus 80%
privateDate Interoperability
• Variety of data makes it difficult to create, exchange, & use data across different applications and systems
Date Analytics & Visualization• Requires advanced
computational capabilities, algorithms, & large data stores to analyze these data
80% Dark Data
Instrumentation, logging, sensors, external data, user generated content
Reliable data flow, infrastructure, pipelines,structures and unstructured data storage
Cleaning, anomaly detection, prep
Analytic, metrics, segments, aggregates, features, training data
A/B testing,experimentation,
simple ML algorithms
AI& Deep
Learning
Learn &Optimize
Aggregate& Label
Explore &Transform
Collect
Move &Store
working up the Data Science
Hierarchyof Needs
![Page 3: Starting “small” to go Big: Building a Living Database · Big: Building a . Living Database. ... external data, user generated content. Reliable data flow, infrastructure, pipelines,](https://reader036.vdocuments.mx/reader036/viewer/2022081404/5f056e087e708231d412ec57/html5/thumbnails/3.jpg)
3
Discovered & integrated open data sources of
information related to oil & gas infrastructure
across the globe
Collect
https://edx.netl.doe.gov/dataset/global-oil-gas-features-database
“small” Beginnings: Developing a Global Oil & Gas Database
Machine Learning Automated Approach- A tool that scans “seed” resources and identifies relevant keywords, then crawls the web and parses the data for integration
>700 datasets>4 million features
![Page 4: Starting “small” to go Big: Building a Living Database · Big: Building a . Living Database. ... external data, user generated content. Reliable data flow, infrastructure, pipelines,](https://reader036.vdocuments.mx/reader036/viewer/2022081404/5f056e087e708231d412ec57/html5/thumbnails/4.jpg)
4
EDX - A Virtual Library & Laboratoryfor Energy Science
• Virtualizing team analytics• Continuing innovations to connect
researchers to online Earth-Energy system resources
• Increasing number of tools & apps for use in team workspaces
Move & Store
https://edx.netl.doe.gov
Data Workflows& Structure
• Custom “smart search” tool in development
• Digital spatial team “notebook”
• Auto-indexing algorithm, provides analysis of your search and helps recommend other items
EDX vs Dark Data
80% Dark Data
EDX Smart Search - A machine learning, big data tool for rapid, online, .Zip, & FTP spatial & non-spatial data
mining with Hadoop + Bing + ESRI
![Page 5: Starting “small” to go Big: Building a Living Database · Big: Building a . Living Database. ... external data, user generated content. Reliable data flow, infrastructure, pipelines,](https://reader036.vdocuments.mx/reader036/viewer/2022081404/5f056e087e708231d412ec57/html5/thumbnails/5.jpg)
5
Explore & Transform
https://edx.netl.doe.gov
The Living Database• Store & Share Data in a Structured
Secure Database Environment• Reduce Redundant Acquisition• Direct Data Access (not file based storage)• Consistent Data with Staff Turnover• Enhance Collaboration
• Curation of data and knowledge• Allows Direct Analysis from Database
Storing Databases with different data types, formats, & resolutions
Includes Data workflow, infrastructure, pipelines, structured & unstructured data
People
DataLifecycleApps
ResearchExternalApps
![Page 6: Starting “small” to go Big: Building a Living Database · Big: Building a . Living Database. ... external data, user generated content. Reliable data flow, infrastructure, pipelines,](https://reader036.vdocuments.mx/reader036/viewer/2022081404/5f056e087e708231d412ec57/html5/thumbnails/6.jpg)
6https://edx.netl.doe.gov
• Developing tools & approaches to manage multiple heterogeneous datasets
• Develop a probabilistic approach to assess scientific data using big data analyses
• Develop stochastic approaches to reduce uncertainty
Improve joint analysis of multiple datasetsfocus on advancing “Big Data” mining, machine
learning, and advanced geoprocessing computing
Aggregate & Label
Tools, Analytics, & Metrics
Select relevant datasetsCombine data and tools to…
Highlight resultant data and analysis and reuse for
in further research
Evaluatecorrelations and spatio-temporal
trends
![Page 7: Starting “small” to go Big: Building a Living Database · Big: Building a . Living Database. ... external data, user generated content. Reliable data flow, infrastructure, pipelines,](https://reader036.vdocuments.mx/reader036/viewer/2022081404/5f056e087e708231d412ec57/html5/thumbnails/7.jpg)
7
EDX continues to evolve in response to the needs of its users and NETL’s knowledge
transfer goals
https://edx.netl.doe.gov
Future Big Data Development &
Analysis
Learn & Optimize
• EDX Cloud Services• Living Database• Common Operating
Platform for Data Analytics
• Geocube Spatial Data Viewer
• Fuzzy Logic Analytics (SIMPA)
• AWS Development• Integration with
decades of DOE R&D• Federating Open Source
data• EDX & GeoCube (search
and location)• ID Data gaps in
subsurface puzzle
GOGI
Oil & Gas, Geothermal Data
& Resources
Carbon Storage Data & Resources
Millions of Records
Millions of Records
Millions of Records
Employing “smart” search tools to include
open resources
Billions of Records
*These attributes data sources are evolving quickly with implementation of new tools and engagement of key stakeholders
![Page 8: Starting “small” to go Big: Building a Living Database · Big: Building a . Living Database. ... external data, user generated content. Reliable data flow, infrastructure, pipelines,](https://reader036.vdocuments.mx/reader036/viewer/2022081404/5f056e087e708231d412ec57/html5/thumbnails/8.jpg)
8
Advanced computer science& research
https://edx.netl.doe.gov
Developing Schema Matching AI
• Variety of data sources with diverse data schemas
• Manual schema matching is time consuming & inefficient
• Plan to develop and use existing machine leaning algorithms to match disparate data schemas:
• Schema level• Element Level• Structure Level
• Linguistic Matching
• Syntactic Techniques
~ Thank you! ~
Michael [email protected]
![Page 9: Starting “small” to go Big: Building a Living Database · Big: Building a . Living Database. ... external data, user generated content. Reliable data flow, infrastructure, pipelines,](https://reader036.vdocuments.mx/reader036/viewer/2022081404/5f056e087e708231d412ec57/html5/thumbnails/9.jpg)
9
Questions?
Come check out this awesome poster!