spatiotemporal stream mining using emm

47
4/24/09 - KSU Spatiotemporal Stream Mining Using EMM Margaret H. Dunham Southern Methodist University Dallas, Texas 75275 [email protected] This material is based in part upon work supported by the National Science Foundation under Grant No. 9820841 1

Upload: nash

Post on 23-Mar-2016

45 views

Category:

Documents


5 download

DESCRIPTION

Spatiotemporal Stream Mining Using EMM. Margaret H. Dunham Southern Methodist University Dallas, Texas 75275 [email protected] This material is based in part upon work supported by the National Science Foundation under Grant No. 9820841. Completely Data Driven Model. WARNING. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Spatiotemporal Stream Mining Using EMM

4/24/09 - KSU

Spatiotemporal Stream Mining Using EMM

Margaret H. DunhamSouthern Methodist University

Dallas, Texas [email protected]

This material is based in part upon work supported by the National Science Foundation under Grant No. 9820841

1

Page 2: Spatiotemporal Stream Mining Using EMM

Completely Data Driven Model

No assumptions about data We only know the general format of the data

THE DATA WILL TELL US WHAT THE MODEL SHOULD LOOK LIKE!

2

WARNING

4/24/09 - KSU

Page 3: Spatiotemporal Stream Mining Using EMM

Motivation

A growing number of applications generate streams of data. Computer network monitoring data Call detail records in telecommunications (Cisco VoIP

2003) Highway transportation traffic data (MnDot 2005) Online web purchase log records (JCPenney 2003,

Travelociy 2005) Sensor network data (Ouse, Derwent 2002) Stock exchange, transactions in retail chains, ATM

operations in banks, credit card transactions.

34/24/09 - KSU

Page 4: Spatiotemporal Stream Mining Using EMM

4

EMM Build<18,10,3,3,1,0,

0>

<17,10,2,3,1,0,

0>

<16,9,2,3,1,0,0

>

<14,8,2,3,1,0,0

>

<14,8,2,3,0,0,0

>

<18,10,3,3,1,1,

0.>

1/3

N1

N2

2/3

N3

1/11/3

N1

N2

2/3

1/1

N3

1/1

1/2

1/3

N1

N2

2/3 1/2

1/2

N3

1/1

2/3

1/3

N1

N2

N1

2/21/1

N1

1

4/24/09 - KSU

Page 5: Spatiotemporal Stream Mining Using EMM

Spatiotemporal Stream Mining Using EMM

Spatiotemporal Stream Data EMM vs MM vs other dynamic MM

techniques EMM Overview EMM Applications

54/24/09 - KSU

Page 6: Spatiotemporal Stream Mining Using EMM

6

Spatiotemporal Environment Observations arriving in a stream At any time, t, we can view the state of

the problem as represented by a vector of n numeric values:

Vt = <S1t, S2t, ..., Snt>

V1 V2 … Vq

S1 S11 S12 … S1q

S2 S21 S22 … S2q

… … … … …Sn Sn1 Sn2 … Snq

Time 4/24/09 - KSU

Page 7: Spatiotemporal Stream Mining Using EMM

7

Data Stream Modeling Requirements Single pass: Each record is examined at most once Bounded storage: Limited Memory for storing

synopsis Real-time: Per record processing time must be low Summarization (Synopsis )of data Use data NOT SAMPLE Temporal and Spatial Dynamic Continuous (infinite stream) Learn Forget Sublinear growth rate - Clustering

74/24/09 - KSU

Page 8: Spatiotemporal Stream Mining Using EMM

8

MMA first order Markov Chain is a finite or countably infinite

sequence of events {E1, E2, … } over discrete time points, where Pij = P(Ej | Ei), and at any time the future behavior of the process is based solely on the current state

A Markov Model (MM) is a graph with m vertices or states, S, and directed arcs, A, such that:

S ={N1,N2, …, Nm}, and A = {Lij | i 1, 2, …, m, j 1, 2, …, m} and Each arc,

Lij = <Ni,Nj> is labeled with a transition probability Pij = P(Nj | Ni).

4/24/09 - KSU

Page 9: Spatiotemporal Stream Mining Using EMM

9

Problem with Markov Chains

The required structure of the MC may not be certain at the model construction time.

As the real world being modeled by the MC changes, so should the structure of the MC.

Not scalable – grows linearly as number of events. Our solution:

Extensible Markov Model (EMM) Cluster real world events Allow Markov chain to grow and shrink dynamically

4/24/09 - KSU

Page 10: Spatiotemporal Stream Mining Using EMM

10

Extensible Markov Model (EMM)

Time Varying Discrete First Order Markov Model Nodes (Vertices) are clusters of real world

observations. Learning continues during application phase. Learning:

Transition probabilities between nodes Node labels (centroid/medoid of cluster) Nodes are added and removed as data arrives

4/24/09 - KSU

Page 11: Spatiotemporal Stream Mining Using EMM

11

Related Work Splitting Nodes in HMMs

Create new states by splitting an existing state M.J. Black and Y. Yacoob,”Recognizing facial expressions in image sequences using

local parameterized models of image motion”, Int. Journal of Computer Vision, 25(1), 1997, 23-48.

Dynamic Markov Modeling States and transitions are cloned G. V. Cormack, R. N. S. Horspool. “Data compression using dynamic Markov

Modeling,” The Computer Journal, Vol. 30, No. 6, 1987. Augmented Markov Model (AMM)

Creates new states if the input data has never been seen in the model, and transition probabilities are adjusted

Dani Goldberg, Maja J Mataric. “Coordinating mobile robot group behavior using a model of interaction dynamics,” Proceedings, the Third International Conference on Autonomous Agents (agents ’99), Seattle, Washington

4/24/09 - KSU

Page 12: Spatiotemporal Stream Mining Using EMM

12

EMM vs AMMOur proposed EMM model is similar to AMM, but is more flexible: EMM continues to learn during the application phase. The EMM is a generic incremental model whose nodes can

have any kind of representatives. State matching is determined using a clustering technique. EMM not only allows the creation of new nodes, but deletion

(or merging) of existing nodes. This allows the EMM model to “forget” old information which may not be relevant in the future. It also allows the EMM to adapt to any main memory constraints for large scale datasets.

EMM performs one scan of data and therefore is suitable for online data processing.

4/24/09 - KSU

Page 13: Spatiotemporal Stream Mining Using EMM

EMM Operations

Input: EMM Output: EMM’

EMM Build – Modify/add nodes/arcs based on input observations

EMM Prune – Removes nodes/arcs EMM Merge – Combine multiple EMM nodes EMM Split – Split a node into multiple nodes EMM Age – Modify relative weights of old versus new

oberservations EMM Combine – Merge multiple EMMS by merging

specific states and transitions.

144/24/09 - KSU

Page 14: Spatiotemporal Stream Mining Using EMM

Loc_1 Loc_2 Loc_3 Loc_4 Loc_5 Loc_6 Loc_7

1 20 50 100 30 25 4 102 20 80 50 20 10 10 103 40 30 75 20 30 20 254 15 60 30 30 10 10 155 40 15 25 10 35 40 96 5 5 40 35 10 5 47 0 35 55 2 1 3 58 20 60 30 11 20 15 109 45 40 15 18 20 20 1510 15 20 40 40 10 10 1411 5 45 55 10 10 15 012 10 30 10 4 15 15 10

Example from rEMM (R Package Available)

Courtesy Mike Hahsler

Page 15: Spatiotemporal Stream Mining Using EMM

16

EMM Prune

N2

N1 N3

N5 N6

2/2

1/3

1/3

1/3

1/2

N1 N3

N5 N6

1/61/6

1/6

1/31/3

1/3Delete N2

4/24/09 - KSU

Page 16: Spatiotemporal Stream Mining Using EMM

Artificial Data

−0.2 0.0 0.2 0.4 0.6 0.8 1.0 x

Page 17: Spatiotemporal Stream Mining Using EMM

18

EMM Advantages

Dynamic Adaptable Use of clustering Learns rare event Sublinear Growth Rate Creation/evaluation quasi-real time Distributed / Hierarchical extensions Overlap Learning and Testing

4/24/09 - KSU

Page 18: Spatiotemporal Stream Mining Using EMM

EMM Applications

Predict – Forecast future state values. Evaluate (Score) – Assess degree of model

compliance. Find the probability that a new observation belongs to the same class of data modeled by the given EMM.

Analyze – Report model characteristics concerning EMM.

Visualize – Draw graph Probe – Report specific detailed information

about a state (if available)

194/24/09 - KSU

Page 19: Spatiotemporal Stream Mining Using EMM

EMM Results

Predicting FloodingOuse and Derwent – River flow data from

Englandhttp://www.nercwallingford.ac.uk/ih/nrfa/index.html

Rare Event DetectionVoIP Traffic Data obtained at Cisco SystemsMinnesota Traffic Data

ClassificationDNA/RNA Sequence Analysis

204/24/09 - KSU

Page 20: Spatiotemporal Stream Mining Using EMM

Derwent River (UK)

21

28043

28011

28048

28010

28023

28117

4/24/09 - KSU

0

100

200

300

400

500

600

700

800

1 108 215 322 429 536 643 750 857 964 1071 1178 1285 1392 1499

num

ber o

f sta

te in

mod

el

number of input data (total 1574)

threshold 0.994

threshold 0.995

threshold 0.996

threshold 0.997

threshold 0.998

Page 21: Spatiotemporal Stream Mining Using EMM

22

Sublinear Growth Rate

Data SimThreshold

0.99 0.992 0.994 0.996 0.998

Derwent

Jaccrd 156 190 268 389 667Dice 72 92 123 191 389

Cosine 11 14 19 31 61Ovrlap 2 2 3 3 4

Ouse

Jaccrd 56 66 81 105 162Dice 40 43 52 66 105

Cosine 6 8 10 13 24Ovrlap 1 1 1 1 1

4/24/09 - KSU

Page 22: Spatiotemporal Stream Mining Using EMM

23

Prediction Error Rates

Normalized Absolute Ratio Error (NARE)

NARE =

Root Means Square (RMS)

RMS =

N

t

N

t

tO

tPtO

1

1

)(

|)()(|

N

tPtON

t

12))()((

4/24/09 - KSU

Page 23: Spatiotemporal Stream Mining Using EMM

24

EMM Performance – Prediction (Ouse)

NARE RMSNo of States

RLF 0.321423 1.5389

EMMTh=0.95 0.068443 0.43774 20Th=0.99 0.046379 0.4496 56

Th=0.995 0.055184 0.57785 92

4/24/09 - KSU

Page 24: Spatiotemporal Stream Mining Using EMM

25

EMM Water Level Prediction – Ouse Data

0

1

2

3

4

5

6

7

8

1 38 75 112

149

186

223

260

297

334

371

408

445

482

519

556

593

630

667

Input Time Series

Wat

er L

evel

(m)

RLF Prediction EMM Prediction Observed

4/24/09 - KSU

Page 25: Spatiotemporal Stream Mining Using EMM

26

Rare Event

Rare - Anomalous – Surprising Out of the ordinary Not outlier detection

No knowledge of data distribution Data is not static Must take temporal and spatial values into

account May be interested in sequence of events

Ex: Snow in upstate New York is not rare Snow in upstate New York in June is rare

Rare events may change over time

4/24/09 - KSU

Page 26: Spatiotemporal Stream Mining Using EMM

27

Rare Event Examples

The amount of traffic through a site in a particular time interval as extremely high or low.

The type of traffic (i.e. source IP addresses or destination addresses) is unusual.

Current traffic behavior is unusual based on recent precious traffic behavior.

Unusual behavior at several sites.

4/24/09 - KSU

Page 27: Spatiotemporal Stream Mining Using EMM

28

Rare Event Detection Applications

Intrusion Detection Fraud Flooding Unusual automobile/network traffic

4/24/09 - KSU

Page 28: Spatiotemporal Stream Mining Using EMM

30

Our Approach

By learning what is normal, the model can predict what is not

Normal is based on likelihood of occurrence Use EMM to build model of behavior We view a rare event as:

Unusual event Transition between events states which does

not frequently occur. Base rare event detection on determining events

or transitions between events that do not frequently occur.

Continue learning

4/24/09 - KSU

Page 29: Spatiotemporal Stream Mining Using EMM

31

EMMRare

EMMRare algorithm indicates if the current input event is rare. Using a threshold occurrence percentage, the input event is determined to be rare if either of the following occurs:

The frequency of the node at time t+1 is below this threshold

The updated transition probability of the MC transition from node at time t to the node at t+1 is below the threshold

4/24/09 - KSU

Page 30: Spatiotemporal Stream Mining Using EMM

32

Determining Rare

Occurrence Frequency (OFc) of a node Nc :

OFc =

Normalized Transition Probability (NTPmn), from one state, Nm, to another, Nn :

NTPmn =

c ii

CN CN

mn ii

CL CN

4/24/09 - KSU

Page 31: Spatiotemporal Stream Mining Using EMM

33

EMMRareGiven:

• Rule#1: CNi <= thCN

• Rule#2: CLij <= thCL

• Rule#3: OFc <= thOF • Rule#4: NTPmn <= thNTP

Input: Gt: EMM at time t i: Current state at time t R= {R1, R2,…,RN}: A set of rules

Output: At: Boolean alarm at time t Algorithm: At =

1 Ri = True 0 Ri = False

4/24/09 - KSU

Page 32: Spatiotemporal Stream Mining Using EMM

12/13/05 34

VoIP Traffic Data

4/24/09 - KSU

Page 33: Spatiotemporal Stream Mining Using EMM

35

Rare Event in Cisco Data

4/24/09 - KSU

Page 34: Spatiotemporal Stream Mining Using EMM

Temporal Heat Map

Also called Temporal Chaos Game Representation (TCGR) Temporal Heat Map (THM) is a visualization technique for streaming

data derived from multiple sensors. It is a two dimensional structure similar to an infinite table. Each row of the table is associated with one sensor value. Each column of the table is associated with a point in time. Each cell within the THM is a color representation of the sensor

value Colors normalized (in our examples)

0 – While 0.5 – Blue 1.0 - Red

364/24/09 - KSU

Page 35: Spatiotemporal Stream Mining Using EMM

37

Cisco – Internal VoIP Traffic Data

• Time →

•Va

lues

• Complete Stream: CiscoEMM.png

• VoIP traffic data was provided by Cisco Systems and represents logged VoIP traffic in their Richardson, Texas facility from Mon Sep 22 12:17:32 2003 to Mon Nov 17 11:29:11 2003.

4/24/09 - KSU

Page 36: Spatiotemporal Stream Mining Using EMM

38

Rare Event Detection

Weekdays Weekend

Minnesota DOT Traffic Data

Detected unusual weekend traffic pattern

4/24/09 - KSU

Page 37: Spatiotemporal Stream Mining Using EMM

39

TCGR Exampleacgtgcacgtaactgattccggaaccaaatgtgcccacgtcga

Moving Window

A C G TPos 0-8 2 3 3 1Pos 1-9 1 3 3 2

…Pos 34-42 2 4 2 1

A C G TPos 0-8 0.4 0.6 0.6 0.2Pos 1-9 0.2 0.6 0.6 0.4

…Pos 34-42 0.4 0.8 0.4 0.2

4/24/09 - KSU

Page 38: Spatiotemporal Stream Mining Using EMM

40

TCGR Example (cont’d)

TCGRs for Sub-patterns of length 1, 2, and 3

4/24/09 - KSU

Page 39: Spatiotemporal Stream Mining Using EMM

41

TCGR Example (cont’d)

Window 0: Pos 0-8Window 1: Pos 1-9

Window 17: Pos 17-25Window 18: Pos 18-26

Window 34: Pos 34-42

acgtgcacgcgtgcacgt

tccggaaccccggaacca

ccacgtcga

A C G T

4/24/09 - KSU

Page 40: Spatiotemporal Stream Mining Using EMM

43

TCGR – Mature miRNA(Window=5; Pattern=3)

All Mature

Mus musculus

Homo sapiens

C. elegans

ACG CGC GCG UCG4/24/09 - KSU

Page 41: Spatiotemporal Stream Mining Using EMM

44

Research Approach

1. Represent potential miRNA sequence with TCGR sequence of count vectors

2. Create EMM using count vectors for known miRNA (miRNA stem loops, miRNA targets)

3. Predict unknown sequence to be miRNA (miRNA stem loop, miRNA target) based on normalized product of transition probabilities along clustering path in EMM

4/24/09 - KSU

Page 42: Spatiotemporal Stream Mining Using EMM

45

Related Work 1

Predicted occurrence of pre-miRNA segments form a set of hairpin sequences

No assumptions about biological function or conservation across species.

Used SVMs to differentiate the structure of hiarpin segments that contained pre-miRNAs from those that did not.

Sensitivey of 93.3% Specificity of 88.1%

1 C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure-Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.

4/24/09 - KSU

Page 43: Spatiotemporal Stream Mining Using EMM

46

Preliminary Test Data1

Positive Training: This dataset consists of 163 human pre-miRNAs with lengths of 62-119.

Negative Training: This dataset was obtained from protein coding regions of human RefSeq genes. As these are from coding regions it is likely that there are no true pre-miRNAs in this data. This dataset contains 168 sequences with lengths between 63 and 110 characters.

Positive Test: This dataset contains 30 pre-miRNAs. Negative Test: This dataset contains 1000 randomly

chosen sequences from coding regions.

1 C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure-Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.

4/24/09 - KSU

Page 44: Spatiotemporal Stream Mining Using EMM

47

POSITIVE

NEGATIVE

TCGRs for Xue Training Data

4/24/09 - KSU

Page 45: Spatiotemporal Stream Mining Using EMM

48

POSITIVE

NEGATIVE

TCGRs for Xue Test Data

4/24/09 - KSU

Page 46: Spatiotemporal Stream Mining Using EMM

4/24/09 - KSU 49

Page 47: Spatiotemporal Stream Mining Using EMM

References1) Margaret H. Dunham, Nathaniel Ayewah, Zhigang Li, Kathryn Bean, and Jie Huang, “Spatiotemporal Prediction

Using Data Mining Tools,” Chapter XI in Spatial Databases: Technologies, Techniques and Trends, Yannis Manolopouos, Apostolos N. Papadopoulos and Michael Gr. Vassilakopoulos, Editors, 2005, Idea Group Publishing, pp 251-271.

2) Margaret H. Dunham, Yu Meng, and, Jie Huang, “Extensible Markov Model,” Proceedings IEEE ICDM Conference, November 2004, pp 371-374.

3) Yu Meng, Margaret Dunham, Marco Marchetti, and Jie Huang, ”Rare Event Detection in a Spatiotemporal Environment,” Proceedings of the IEEE Conference on Granular Computing, May 2006, pp 629-634.

4) Yu Meng and Margaret H. Dunham, “Online Mining of Risk Level of Traffic Anomalies with User's Feedbacks,” Proceedings of the IEEE Conference on Granular Computing, May 2006, pp 176-181.

5) Yu Meng and Margaret H. Dunham, “Mining Developing Trends of Dynamic Spatiotemporal Data Streams,” Journal of Computers, Vol 1, No 3, June 2006, pp 43-50.

6) Charlie Isaksson, Yu Meng, and Margaret H. Dunham, “Risk Leveling of Network Traffic Anomalies,” International Journal of Computer Science and Network Security, Vol 6, No 6, June 2006, pp 258-265.

7) Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle, “Visualization of DNA/RNA Structure using Temporal CGRs,”Proceedings of the IEEE 6th Symposium on Bioinformatics & Bioengineering (BIBE06), October 16-18, 2006, Washington D.C. ,pp 171-178.

8) Charlie Isaksson and Margaret H. Dunham, “A Comparative Study of Outlier Detection,” 2009, accepted to appear LDM conference, 2009.

4/24/09 - KSU 50