mitigating disasters with deep learning …spills) - from re-purposed scientific instruments (e.g.,...

1
MOTIVATION Uncertainty is inevitable in any attempt to predict earthquakes as the ultimate cause (mantle convection) is inherently non-deterministic - in situ measurements, however, may reduce this uncertainty The availability of actionable observations is time critical to effective and efficient communications of advisories and warnings The establishment of a cause (earthquake) - effect (tsunami) relationship remains outstanding, and is complicated by multiple factors (e.g., tectonic setting) Far-field estimates of tsunami propagation (pre-computed) and coastal inundation (computed in real time), however, have proven to be extremely accurate - results successfully combine data from a distributed array of deep-ocean tsunami detection buoys with a forecasting model THE 6Vs OF SCIENTIFIC VS. SOCIAL NETWORKING DATA CONCLUSIONS Credible tweets could be transformative - Big Data source that can complement traditional sources (e.g., scientific instruments) Working with 6V Twitter data can be challenging, though it also presents interesting opportunities Curation of training data is extremely important, but also extremely time consuming (as this is a manual process) Current research emphasizes Deep Learning, BUT RDF/OWL semantics will need to play a role ultimately Approach can be generalized for application to natural and anthropogenic disasters of all kinds ACCOUNTING FOR OIL SPILLS AND OTHER DISASTERS … Energy exploration via reflection seismology provides the fundamental source of data that is subsequently processed and interpreted for the identification of potential petroleum reservoirs Reservoir simulation is used to engineer the extraction of petroleum reserves from reservoirs Drilling is used to ‘truth’ the results provided by interpretations and simulations prior to production extraction SOPs ensure extraction of oil from a production reservoir is routinely monitored and reported upon - e.g., to quantify rig safety and output (barrels/day) From exploration to extraction, this is a data-rich workflow Additional data sources become relevant when disasters occur (e.g., oil spills) - from re-purposed scientific instruments (e.g., weather satellites) to social media (e.g., Twitter, Instagram, Snapchat, ...) Data-rich workflows can generate problems in Big Data Analytics Deep Learning Pipeline THE OPPORTUNITY FOR SEMANTICS A feature vector is a feature vector - it is devoid of semantics Ignores inherent, overall credibility of a Tweet - e.g., as quantified by TweetCred Twitter metadata (handles, hashtags and URLs) contributes equally to Twitter data (unstructured text that comprises the body of a Tweet) in constructing feature vectors - i.e., the semantic value of Twitter metadata is also ignored by Deep Learning The W3C’s Resource Description Framework (RDF) facilitates the representation of metadata and thus exposes semantics The W3C’s Web Ontology Language (OWL) accounts for domain specifics - disambiguates use of overloaded terms (e.g., “earthquake”) in different contexts (e.g., geophysics vs. movies vs. …) Deep Learning in combination with RDF/OWL semantics has the potential to produce learned models with knowledge represented MITIGATING DISASTERS WITH DEEP LEARNING FROM TWITTER? WWW.UNIVA.COM Deep Learning pipeline implemented using the Machine Learning Library (MLlib) from Apache Spark scaled onto a converged Big Data/HPC cluster via Univa Universal Resource Broker for featurization, training, evaluation and operational use. Copyright © 2017 Univa® and Grid Engine® are registered trademarks of Univa Corporation 1 3 2 6 4 5 Data extracted from Twitter via a Perl script that targeted the hashtag #earthquake Spark MLlib HashingTF establishes frequency- based usage Spark MLlib Logistic Regression with SGD classifies spam vs. ham Recent ‘earthquake’ data from Twitter used to evaluate model Featurization Training Model Evaluation Feature Vectors Training Data Twitter data manually curated into ‘ham’ and ‘spam’, then represented in-memory via Spark RDDs SPAM HAM Model Best Model + + + + + + + + + + + + + + + 1000+ Apps Data Sources Univa Universal Resource Broker Univa Grid Engine Scheduler API Command Line Spark UIs Data Frames ML Pipelines MLib GraphX Spark Streaming Spark Core Spark SQL Volume Variety Velocity Veracity Validity Volatility small'ish, finite semi-structured, restricted slow, sampled low (stationary, irreplaceable) BIG, ‘infinite’ unstructured, unrestricted - except for handles, hashtags & URLs (pages, images) fast, streamed high? (mobile? disposable?) Traditional Scientific Data Twitter Data Created at: Wed Jun 04 20:29:33 +0000 2014 5.0 earthquake! Thu Jun 05 02:04:27 GMT+09:00 2014 near 84km SW of Iquique, Chile http://t.co/mmFokGQWT7 #earthquake Created at: Wed Jun 04 20:30:13 +0000 2014 The #earthquake continues: Latest via @Spectator_CH /@YouGov -#Labour 36 #Tories 32%, LD 8%, #Ukip 14%. Implied Labour majority- 42 . Created at: Wed Jun 04 20:31:35 +0000 2014 #terremoto ML 2.7 CENTRAL ITALY: Magnitude ††ML 2.7 Region ††CENTRAL ITALY Date time ††2014-06-04 20:01:33.9 UTC... http://t.co/Y141Ovu6kP Created at: Tue Jun 10 12:22:34 +0000 2014 RT @TheRock: Just wrapped a massive post earthquake scene for SAN ANDREAS. To the hundreds of background actors/extras.. THANK U for all yo... biases, noise & abnormalities accuracy & correctness April 16, 2016 01:25 JST Magnitude 7.1 earthquake Kyushu area, 10 km depth 01:27 JST Tsunami advisories issues • Imminent arrival • ~ 1m maximum height 01:29 JST High-tide amplification advisory 04:43 - 04:54 JST High-tide times 01:40 JST Estimated tsunami first arrives 02:14 JST Tsunami advisories lifted x Japan Meteorological Agency

Upload: others

Post on 26-Jan-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MITIGATING DISASTERS WITH DEEP LEARNING …spills) - from re-purposed scientific instruments (e.g., weather satellites) to social media (e.g., Twitter, Instagram, Snapchat, ...) Data-rich

MOTIVATION

Uncertainty is inevitable in any attempt to predict earthquakes as theultimate cause (mantle convection) is inherently non-deterministic - insitu measurements, however, may reduce this uncertainty

The availability of actionable observations is time critical to effective and efficient communications of advisories and warnings

The establishment of a cause (earthquake) - effect (tsunami) relationship remains outstanding, and is complicated by multiple factors (e.g., tectonic setting)

Far-field estimates of tsunami propagation (pre-computed) and coastalinundation (computed in real time), however, have proven to be extremely accurate - results successfully combine data from a distributed array of deep-ocean tsunami detection buoys with a forecasting model

THE 6Vs OF SCIENTIFIC VS. SOCIAL NETWORKING DATA

CONCLUSIONS

Credible tweets could be transformative - Big Data source that cancomplement traditional sources (e.g., scientific instruments)

Working with 6V Twitter data can be challenging, though it also presents interesting opportunities

Curation of training data is extremely important, but also extremely time consuming (as this is a manual process)

Current research emphasizes Deep Learning, BUT RDF/OWL semantics will need to play a role ultimately

Approach can be generalized for application to natural andanthropogenic disasters of all kinds

ACCOUNTING FOR OIL SPILLS AND OTHER DISASTERS …

Energy exploration via reflection seismology provides the fundamental source of data that is subsequently processed and interpreted for the identification of potential petroleum reservoirs

Reservoir simulation is used to engineer the extraction of petroleumreserves from reservoirs

Drilling is used to ‘truth’ the results provided by interpretations andsimulations prior to production extraction

SOPs ensure extraction of oil from a production reservoir is routinely monitored and reported upon - e.g., to quantify rig safety and output (barrels/day)

From exploration to extraction, this is a data-rich workflow

Additional data sources become relevant when disasters occur (e.g., oil spills) - from re-purposed scientific instruments (e.g., weather satellites)to social media (e.g., Twitter, Instagram, Snapchat, ...)

Data-rich workflows can generate problems in Big Data Analytics

Deep Learning Pipeline

THE OPPORTUNITY FOR SEMANTICS

A feature vector is a feature vector - it is devoid of semantics

Ignores inherent, overall credibility of a Tweet - e.g., as quantified by TweetCred Twitter metadata (handles, hashtags and URLs) contributes equally to Twitter data (unstructured text that comprises the body of a Tweet) in constructing feature vectors - i.e., the semantic value of Twitter metadata is also ignored by Deep Learning

The W3C’s Resource Description Framework (RDF) facilitates therepresentation of metadata and thus exposes semantics

The W3C’s Web Ontology Language (OWL) accounts for domain specifics - disambiguates use of overloaded terms (e.g., “earthquake”) in different contexts (e.g., geophysics vs. movies vs. …)

Deep Learning in combination with RDF/OWL semantics has the potential to produce learned models with knowledge represented

MITIGATING DISASTERS WITH DEEP LEARNING FROM TWITTER?

WWW.UNIVA.COM�

Deep Learning pipeline implemented using the Machine Learning Library (MLlib) from Apache Spark scaled onto a converged Big Data/HPC cluster via Univa Universal Resource Broker for featurization, training, evaluationand operational use.

Copyright © 2017 Univa® and Grid Engine® are registered trademarks of Univa Corporation

1

3

2 6

4

5Data extracted from Twitter via a Perl script that targeted the hashtag #earthquake

Spark MLlib HashingTF establishes frequency- based usage

Spark MLlib Logistic Regression with SGD classifies spam vs. ham

Recent ‘earthquake’ data from Twitter used toevaluate model

Featurization Training Model Evaluation

Feature VectorsTraining Data

Twitter data manually curated into ‘ham’ and ‘spam’, then represented in-memory via Spark RDDs

SPAM

HAM

Model Best Model

++

+

––

++

+–

++

+–

++

+–

++

+

––

1000+ Apps Data Sources

Univa Universal Resource Broker

Univa Grid Engine Scheduler

API Command Line

Spark UIs

Data Frames ML Pipelines

MLib GraphXSparkStreaming

Spark Core

SparkSQL

Volume

Variety

Velocity

Veracity

Validity

Volatility

small'ish, finite

semi-structured, restricted

slow, sampled

low (stationary, irreplaceable)

BIG, ‘infinite’

unstructured, unrestricted - except for handles, hashtags& URLs (pages, images)

fast, streamed

high? (mobile? disposable?)

Traditional Scientific Data Twitter Data

Created at: Wed Jun 04 20:29:33 +0000 20145.0 earthquake! Thu Jun 05 02:04:27 GMT+09:00 2014 near 84km SW of Iquique, Chile http://t.co/mmFokGQWT7 #earthquake

Created at: Wed Jun 04 20:30:13 +0000 2014The #earthquake continues: Latest via @Spectator_CH /@YouGov -#Labour 36 #Tories 32%, LD 8%, #Ukip 14%. Implied Labour majority- 42 .

Created at: Wed Jun 04 20:31:35 +0000 2014#terremoto ML 2.7 CENTRAL ITALY: Magnitude ††ML 2.7 Region ††CENTRAL ITALY Date time ††2014-06-04 20:01:33.9 UTC... http://t.co/Y141Ovu6kP

Created at: Tue Jun 10 12:22:34 +0000 2014RT @TheRock: Just wrapped a massive post earthquake scene for SAN ANDREAS. To the hundreds of background actors/extras.. THANK U for all yo...

biases, noise & abnormalities

accuracy & correctness April 16, 2016

01:25 JSTMagnitude 7.1 earthquakeKyushu area, 10 km depth

01:27 JSTTsunami advisories issues• Imminent arrival• ~ 1m maximum height

01:29 JSTHigh-tide amplificationadvisory

04:43 - 04:54 JSTHigh-tide times

01:40 JSTEstimated tsunami first arrives

02:14 JSTTsunami advisories lifted

x

Japan Meteorological Agency