(4ap 6eap) - ut · • rapidminer: free open‐source software for knowledge discovery, data...

115
Data Mining MTAT.03.183 (4AP 6EAP) (4AP = 6EAP) Streams, time series Jaak Vilo Jaak Vilo 2009 Fall

Upload: others

Post on 08-Nov-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Data Mining MTAT.03.183(4AP 6EAP)(4AP = 6EAP)

Streams, time series,

Jaak ViloJaak Vilo

2009 Fall

Page 2: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Summary so farSummary so far

• Data preparation

• Machine learningMachine learning

• Statistics/significance

• Large data – algorithmics

• VisualisationVisualisation

• Queries/reporting, OLAP

• Different types of data

• Business value• Business valueJaak Vilo and other authors UT: Data Mining 2009 2

Page 3: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Streams time seriesStreams, time series

• Time

• Sequence order and positionSequence order and position

• Continuosly arriving data

Jaak Vilo and other authors UT: Data Mining 2009 3

Page 4: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

WikipediaWikipedia

• Data Stream Mining is the process of extracting knowledge structures from continuous, rapid data records. A data stream is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of times using limited computing and storage capabilities. Examples of data streams include computer network traffic, phone conversations, ATM transactions, web searches, and sensor data. Data stream mining can be considered a subfield of data mining, machine learning, and knowledge discovery.

Jaak Vilo and other authors UT: Data Mining 2009 4

Page 5: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

• In many data stream mining applications, the goal is to predict the class or value of new instances in the data stream given some knowledge about the class membership or values ofsome knowledge about the class membership or values of previous instances in the data stream. Machine learning techniques can be used to learn this prediction task from q plabeled examples in an automated fashion. In many applications, the distribution underlying the instances or the l d l h l b l h hrules underlying their labeling may change over time, i.e. the 

goal of the prediction, the class to be predicted or the target value to be predicted may change over time This problem isvalue to be predicted, may change over time. This problem is referred to as concept drift.

Jaak Vilo and other authors UT: Data Mining 2009 5

Page 6: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

SoftwareSoftware

• RapidMiner: free open‐source software for knowledge discovery, data mining, and g y gmachine learning also featuring data stream mining learning time‐varying concepts andmining, learning time varying concepts, and tracking drifting concept (if used in combination with its data stream miningcombination with its data stream mining plugin (formerly: concept drift plugin))

Jaak Vilo and other authors UT: Data Mining 2009 6

Page 7: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

• MOA (Massive Online Analysis): free open‐source software specific for mining datap gstreams with concept drift. It contains a prequential evaluation method the EDDMprequential evaluation method, the EDDM concept drift methods, a reader of ARFF real datasets and artificial stream generators asdatasets, and artificial stream generators asSEA concepts, STAGGER, rotating hyperplane, random tree, and random radius basedfunctions. MOA supports bi‐directionalinteraction with Weka (machine learning).

Jaak Vilo and other authors UT: Data Mining 2009 7

Page 8: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Literature on Stream MiningLiterature on Stream Mining

• http://www.csse.monash.edu.au/~mgaber/WResources.htm

Jaak Vilo and other authors UT: Data Mining 2009 8

Page 9: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mining Data Streamsg

What is stream data? Why Stream Data Systems?

Stream data management systems: Issues and solutions Stream data management systems: Issues and solutions

Stream data cube and multidimensional OLAP analysis

Stream frequent pattern analysis

Stream classification

Stream cluster analysis Stream cluster analysis

Research issues

November 25, 2009 Data Mining: Concepts and Techniques 9

Page 10: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Characteristics of Data Streams

Data Streams Data Streams Data streams—continuous, ordered, changing, fast, huge amount

T di i l DBMS d d i fi i i dd Traditional DBMS—data stored in finite, persistent data setsdata sets

Characteristics Huge volumes of continuous data, possibly infinite Fast changing and requires fast, real-time response Data stream captures nicely our data processing needs of today Random access is expensive—single scan algorithm (can only have

one look)one look) Store only the summary of the data seen thus far Most stream data are at pretty low-level or multi-dimensional in

November 25, 2009 Data Mining: Concepts and Techniques 10

Most stream data are at pretty low level or multi dimensional in nature, needs multi-level and multi-dimensional processing

Page 11: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Stream Data Applicationspp

Telecommunication calling recordsTelecommunication calling records Business: credit card transaction flows Network monitoring and traffic engineering Network monitoring and traffic engineering Financial market: stock exchange

E i i & i d t i l l & Engineering & industrial processes: power supply & manufacturingS it i & ill id t RFID Sensor, monitoring & surveillance: video streams, RFIDs

Security monitoring Web logs and Web page click streams Massive data sets (even saved but random access is too

November 25, 2009 Data Mining: Concepts and Techniques 11

expensive)

Page 12: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

DBMS versus DSMSe sus

Persistent relations Transient streams Persistent relations

One-time queries

Random access

Transient streams

Continuous queries

Sequential access Random access

“Unbounded” disk store

Only current state matters

Sequential access

Bounded main memory

Historical data is important Only current state matters

No real-time services

Relatively low update rate

Historical data is important

Real-time requirements

Possibly multi-GB arrival rate Relatively low update rate

Data at any granularity

Assume precise data

Possibly multi GB arrival rate

Data at fine granularity

Data stale/imprecise Assume precise data

Access plan determined by query processor, physical DB

Data stale/imprecise

Unpredictable/variable data arrival and characteristics

November 25, 2009 Data Mining: Concepts and Techniques 12

q y p , p ydesign Ack. From Motwani’s PODS tutorial slides

Page 13: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mining Data StreamsMining Data Streams

What is stream data? Why Stream Data Systems?

Stream data management systems: Issues and solutions Stream data management systems: Issues and solutions

Stream data cube and multidimensional OLAP analysis

Stream frequent pattern analysis

Stream classification

Stream cluster analysis Stream cluster analysis

Research issues

November 25, 2009 Data Mining: Concepts and Techniques 13

Page 14: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Architecture: Stream Query ProcessingQ y g

User/ApplicationUser/ApplicationSDMS (Stream Data Use / pp cat oUse / pp cat o

Continuous QueryContinuous Query

Management System)

Continuous QueryContinuous Query

ResultsResults

Stream QueryStream QueryProcessorProcessor

Multiple streamsMultiple streams

ProcessorProcessor

Scratch SpaceScratch Space

November 25, 2009 Data Mining: Concepts and Techniques 14

Scratch SpaceScratch Space(Main memory and/or Disk)(Main memory and/or Disk)

Page 15: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Stream Data Mining Tasks

On-Line analysis of streams Clustering data streams g Classification of data streams Mining frequent patterns in data streams g q p Mining sequential patterns in data streams Mining partial periodicity in data streams Mining outliers and unusual patterns in data

streams ……

Page 16: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Clustering on Streamsg

K-means - not suitable for stream miningnot suitable for stream mining

Clustream- assume shape of the cluster is always assume shape of the cluster is always

circle. Denstream Denstream

- detects arbitrary shape clusters in stream data.

Page 17: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Frequent Pattern Mining (FPM)i d in data streams

Frequent Pattern Mining (FPM) in data streams.Frequent (/hot/top) patterns:Items/Item sets/Sequences occurring, frequently in a database.

ISSUES-Limited memory

-Reading past data is impossible.

Question: How much is it justified to mine f l i d ??frequent pattern only in data stream??

Page 18: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Infrequent pattern mining

Objective:1. To find-out the abnormality , surprising or

“interesting” pattern in the data stream.2. Mutual pattern mining.3 Stream specific item set mining3. Stream specific item set mining.4. Association Rule mining among event of interest.

Application:1. Text mining.2 Distributed Sensor Networks2. Distributed Sensor Networks.3. Works well for evolving data stream.

Page 19: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Challenges in Stream Data Analysisg y

• Data Volume is Huge• Need to remember recent and historical data• Approaches to data reduction• Need single linear scan algorithms• Most existing algorithms and prototype systems are

memory and CPU bound and can only perform a single memory and CPU bound, and can only perform a single data mining function

• Desire to perform multiple analysis at the same timep p y• Occurrence of concept drifts where previous model is

no longer valid• Reduce the cost of learning where models need to be

updated and replacedRequire instant response• Require instant response

Loretta Auvil

Page 20: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Stream Data Reduction

• Challenges of “OLAP-ing” stream data• Challenges of OLAP ing stream data• Raw data cannot be stored• Simple aggregates are not powerful enoughSimple aggregates are not powerful enough• History shape and patterns at different levels are desirable

• MAIDS Unique Approach• A tilted time window to aggregate data at different points gg g p

in time• A scalable multi-dimensional stream data cube that can

t d l f t d t ffi i tl ith t aggregate a model of stream data efficiently without accessing the raw data

Loretta Auvil

Page 21: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

MAIDS Approach: Tilted Time Windowpp

• Recent data is registered and weighted at a finer • Recent data is registered and weighted at a finer granularity than longer term data

• As the edge of a time widow is reached, the finer As the edge of a time widow is reached, the finer granularity data is summarized and propagated to a courser granularity

• Window is maintained automatically

24h 4qtrs 15 i t7d 3024hrs 4qtrs 15minutes7days 30sec

PastTime

Present

Loretta Auvil

Page 22: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

MAIDS: Stream Mining Architectureg

MAIDS is aimed to:M S s a ed to:• Discover changes,

trends and evolution characteristics in data streams

• Construct clusters and classification models f d from data streams

• Explore frequent patterns and patterns and similarities among data streamsdata streams

Loretta Auvil

Page 23: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Features of MAIDS

• General purpose tool for data stream analysis• General purpose tool for data stream analysis• Processes high-rate and multi-dimensional data• Adopts a flexible tilted time window framework• Adopts a flexible tilted time window framework• Facilitates multi-dimensional analysis using a stream

cube architecturecube architecture• Integrates multiple data mining functions• Provides user-friendly interface: automatic analysis and • Provides user friendly interface: automatic analysis and

on-demand analysis• Facilitates setting alarms for monitoringFacilitates setting alarms for monitoring• Built in D2K as D2K modules and leveraged in the D2K

Streamline tool

Loretta Auvil

Page 24: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Statistics Query EngineQ y g

• Answers user queries on data statistics, such as, count, max, min a erage min, average, regression, etc.

U tilt d ti • Uses tilted time window

U ffi i t d t • Uses an efficient data structure, H-tree for partial computation partial computation of data cubes

Loretta Auvil

Page 25: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Stream Data Classifier

• Builds models to • Builds models to make predictions

Uses Naïve • Uses Naïve Bayesian Classifier with boosting

• Uses Tilted Time Uses Tilted Time Window to track time related info

• Sets alarm to monitor events

Loretta Auvil

Page 26: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Stream Pattern Finder

• Find frequent tt ith patterns with

multiple time granularities

• Keep precise/ compressed history in tilted time windowtilted time window

• Mine only the interested item set interested item set using FP-tree algorithm

• Mining evolution and dramatic changes of frequent patternsfrequent patterns

Loretta Auvil

Page 27: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Stream Data Clusteringg

T t i• Two stages: micro-clustering and macro-clustering

• Uses micro-clustering to do incremental, online processing and online processing and maintenance

• Uses tilted time frameUses tilted time frame

• Detects outliers when new clusters are formed

Loretta Auvil

Page 28: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Demonstration

Loretta Auvil

Page 29: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Significant Advances In the Areas of Data M t d Mi iManagement and Mining

• Tilted-time window for multi-resolution modelingM l i di i l l i i b hi• Multi-dimensional analysis using a stream cube architecture

• Efficient “one-look” stream data mining algorithms:• classification, frequent pattern analysis, clustering, and , q p y , g,

information visualization• Integration of “one-look” approaches into one stream data mining

platform so they can cooperate to discover patterns and surprising platform so they can cooperate to discover patterns and surprising events in real-time

• Internationally recognized research leadership in the areas of data management mining and knowledge sharingmanagement, mining, and knowledge sharing

• Experience in development of robust software framework supporting advanced, data mining and information visualizationExperience in development of software environments supporting • Experience in development of software environments supporting problem solving and evidence-based decision making

Loretta Auvil

Page 30: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Knowledge Extraction from Streaming Text g g

Information extractionprocess of using advanced • process of using advanced automated machine learning approaches

• to identify entities in text • to identify entities in text documents

• extract this information along with the relationships these pentities may have in the text documents

Thi j t d t t This project demonstrates information extraction of names, places and organizations from real-time organizations from real time news feeds. As news articles arrive, the information is extracted and displayed.

Loretta Auvil

Page 31: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Challenges of Stream Data Processing

Multiple, continuous, rapid, time-varying, ordered streamsp , , p , y g,

Main memory computations

Queries are often continuous Queries are often continuous Evaluated continuously as stream data arrives

Answer updated over time Answer updated over time

Queries are often complex Beyond element-at-a-time processing

Beyond stream-at-a-time processing

Beyond relational queries (scientific, data mining, OLAP)

Multi-level/multi-dimensional processing and data mining

November 25, 2009 Data Mining: Concepts and Techniques 31

Most stream data are at low-level or multi-dimensional in nature

Page 32: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Processing Stream Queriesg Q

Query types One-time query vs. continuous query (being evaluated

continuously as stream continues to arrive) Predefined query vs. ad-hoc query (issued on-line)

Unbounded memory requirements For real-time response, main memory algorithm should be used Memory requirement is unbounded if one will join future tuples

Approximate query answering With bounded memory, it is not always possible to produce exact

answersanswers High-quality approximate answers are desired Data reduction and synopsis construction methods

November 25, 2009 Data Mining: Concepts and Techniques 32

Data reduction and synopsis construction methods Sketches, random sampling, histograms, wavelets, etc.

Page 33: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Methodologies for Stream Data Processing

Major challenges Keep track of a large universe, e.g., pairs of IP address, not ages

MethodologySynopses (trade off between accuracy and storage) Synopses (trade-off between accuracy and storage)

Use synopsis data structure, much smaller (O(logk N) space) than their base data set (O(N) space)

Compute an approximate answer within a small error range(factor ε of the actual answer)

Major methods Major methods Random sampling Histograms Sliding windows Multi-resolution model Sketches

November 25, 2009 Data Mining: Concepts and Techniques 33

Sketches Radomized algorithms

Page 34: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Stream Data Processing Methods (1)

Random sampling (but without knowing the total length in advance) Reservoir sampling: maintain a set of s candidates in the reservoir,

which form a true random sample of the element seen so far in the stream. As the data stream flow, every new element has a certain st ea s t e data st ea o , e e y e e e e t as a ce taprobability (s/N) of replacing an old element in the reservoir.

Sliding windowsM k d i i b d l t d t f lidi i d i Make decisions based only on recent data of sliding window size w

An element arriving at time t expires at time t + w Histograms Histograms

Approximate the frequency distribution of element values in a stream Partition data into a set of contiguous buckets Equal-width (equal value range for buckets) vs. V-optimal (minimizing

frequency variance within each bucket) Multi-resolution models

November 25, 2009 Data Mining: Concepts and Techniques 34

Multi resolution models Popular models: balanced binary trees, micro-clusters, and wavelets

Page 35: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Stream Data Processing Methods (2)g ( ) Sketches

Hi t d l t i lti th d t b t k t h Histograms and wavelets require multi-passes over the data but sketches can operate in a single pass

Frequency moments of a stream A = {a1, …, aN}, Fk:

v

i

kik mF

1

where v: the universe or domain size, mi: the frequency of i in the sequence

Given N elts and v values, sketches can approximate F0, F1, F2 in O(log v + log N) spaceO(log v + log N) space

Randomized algorithms Monte Carlo algorithm: bound on running time but may not return correct

result Chebyshev’s inequality:

Let X be a random variable with mean μ and standard deviation σ2

2

)|(|k

kXP Let X be a random variable with mean μ and standard deviation σ

Chernoff bound: Let X be the sum of independent Poisson trials X1, …, Xn, δ in (0, 1]

4/2

|])1([ eXP

November 25, 2009 Data Mining: Concepts and Techniques 35

The probability decreases expoentially as we move from the mean

Page 36: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Approximate Query Answering in Streamspp Q y g

Sliding windowsg Only over sliding windows of recent stream data Approximation but often more desirable in applications

Batched processing, sampling and synopses Batched if update is fast but computing is slow

Compute periodically not very timely Compute periodically, not very timely Sampling if update is slow but computing is fast

Compute using sample data, but not good for joins, etc. Compute using sample data, but not good for joins, etc. Synopsis data structures

Maintain a small synopsis or sketch of data Good for querying historical data

Blocking operators, e.g., sorting, avg, min, etc.

November 25, 2009 Data Mining: Concepts and Techniques 36

Blocking if unable to produce the first output until seeing the entire input

Page 37: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Projects on DSMS (Data Stream Management System)Management System)

Research projects and system prototypes

STREAMSTREAM (Stanford): A general-purpose DSMS

CougarCougar (Cornell): sensors

AuroraAurora (Brown/MIT): sensor monitoring, dataflow

Hancock Hancock (AT&T): telecom streams

NiagaraNiagara (OGI/Wisconsin): Internet XML databases

OpenCQOpenCQ (Georgia Tech): triggers, incr. view maintenance

TapestryTapestry (Xerox): pub/sub content-based filtering

TelegraphTelegraph (Berkeley): adaptive engine for sensors

TradebotTradebot (www.tradebot.com): stock tickers & streams

TribecaTribeca (Bellcore): network monitoring

November 25, 2009 Data Mining: Concepts and Techniques 37

MAIDS MAIDS (UIUC/NCSA): Mining Alarming Incidents in Data Streams

Page 38: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Stream Data Mining vs. Stream Queryingg Q y g

Stream mining—A more challenging task in many cases It shares most of the difficulties with stream querying

But often requires less “precision”, e.g., no join, q p , g , j ,grouping, sorting

Patterns are hidden and more general than querying It may require exploratory analysis

Not necessarily continuous queries Stream data mining tasks

Multi-dimensional on-line analysis of streamsy Mining outliers and unusual patterns in stream data Clustering data streams

November 25, 2009 Data Mining: Concepts and Techniques 38

g Classification of stream data

Page 39: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Concept driftConcept drift

• In many applications, the distribution underlying the instances or the rules y gunderlying their labeling may change over time i e the goal of the prediction the classtime, i.e. the goal of the prediction, the class to be predicted or the target value to be predicted may change over time Thispredicted, may change over time. This problem is referred to as concept drift.

November 25, 2009 Data Mining: Concepts and Techniques 39

Page 40: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Episode Rules

• Association rules applied to sequences of events.

• Episode – set of event predicates and partial ordering on themordering on them

© Prentice Hall 40

Page 41: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

BasicsBasics

•• Association rules describe how things occur together in Association rules describe how things occur together in the datathe datathe datathe data– E.g., "IF an alarm has certain properties, THEN it will

have other given properties"have other given properties

•• Episode rules describe temporal relationships betweenEpisode rules describe temporal relationships between•• Episode rules describe temporal relationships between Episode rules describe temporal relationships between thingsthings– E g "IF a certain combination of alarms occurs withinE.g., IF a certain combination of alarms occurs within

a time period, THEN another combination of alarms will occur within a time period"

Course on Data MiningCourse on Data Mining 41Page41/54

Page 42: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

BasicsBasics

Network Management SystemNetwork Management System Switched NetworkSwitched Network

MSC MSCMSCMSC

BSC BSCBSCBSCAccess NetworkAccess Network

BTSBTS BTSBTSBTSBTSMSCMSC Mobile station controllerm

s

BSCBSC

BTSBTS

Base station controller

Base station transceiver

Ala

rm

Course on Data MiningCourse on Data Mining 42Page42/54

Page 43: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

BasicsBasics

•• As defined earlier, telecom data contains alarms:As defined earlier, telecom data contains alarms:1234 EL1 PCM 940926082623 A1 ALARMTEXT1234 EL1 PCM 940926082623 A1 ALARMTEXT..

Alarm type Date, time Alarm severity class

•• Now we forget about relationships between attributesNow we forget about relationships between attributesAlarm number

Alarming network element

Now we forget about relationships between attributes Now we forget about relationships between attributes within alarms as with the association ruleswithin alarms as with the association rules

•• We just take the alarm number attribute, handle it hereWe just take the alarm number attribute, handle it hereWe just take the alarm number attribute, handle it here We just take the alarm number attribute, handle it here as event/alarm type and inspect the relationships as event/alarm type and inspect the relationships between events/alarmsbetween events/alarms

Course on Data MiningCourse on Data Mining 43Page43/54

Page 44: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

EpisodesEpisodes

• Partially ordered set of pages

• Serial episode – totally ordered with timeSerial episode totally ordered with time constraint

P ll l i d i l d d i h i• Parallel episode – partial ordered with time constraint

• General episode – partial ordered with no time constrainttime constraint

© Prentice Hall 44

Page 45: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

DAG for EpisodeDAG for Episode

© Prentice Hall 45

Page 46: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

BasicsBasics

•• Data:Data:Data is a set R of events– Data is a set R of events

– Every event is a pair (A, t), where• A R is the event type (e g alarm type)• A R is the event type (e.g., alarm type)• t is an integer, the occurrence time of the event

– Event sequence s on R is a triple (s, Ts, Te)Event sequence s on R is a triple (s, Ts, Te)• Ts is starting time and Te is ending time• Ts < Te are integerss e g• s = (A1, t1), (A2, t2), …, (An, tn) • Ai R and Ts ti < Te for all i=1, …, n

Course on Data MiningCourse on Data Mining 46Page46/54

Page 47: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

BasicsBasics

•• Example alarm data sequence:Example alarm data sequence:

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150

D C A B D A B C A D C A B D A

•• Here:Here:– A, B, C and D are event (or here alarm) types, , ( ) yp– 10…150 are occurrence times– s = (D, 10), (C, 20), …, (A, 150) ( ) ( ) ( )– Ts (starting time) = 10 and Te (ending time) = 150

•• Note: There needs Note: There needs notnot to be events on every time slot!to be events on every time slot!

Course on Data MiningCourse on Data Mining 47Page47/54

Page 48: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

BasicsBasics

•• Episodes:Episodes:A i d i i (V )– An episode is a pair (V, )

• V is a collection of event types, e.g., alarm typesi i l d• is a partial order on V

– Given a sequence S of alarms, an episode = (V, )ithi S if th i f ti f i th toccurs within S if there is a way of satisfying the event

types (e.g., alarm types) in V using the alarms of S so that the partial order is respectedthat the partial order is respected

– Intuitively: episodes consist of alarms that have certain properties and occur in a certain partial order

Course on Data MiningCourse on Data Mining 48Page48/54

p p p

Page 49: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

BasicsBasics

•• The most useful partial orders are:The most useful partial orders are:T t l d– Total orders

• The predicates of each episode have a fixed orderh i d ll d l ( d d )• Such episodes are called serial (or "ordered")

– Trivial partial orders• The order of predicates is not considered• Such episodes are called parallel (or "unordered")

•• Complicated?Complicated?– Not really, let's take some clarifying examples

Course on Data MiningCourse on Data Mining 49Page49/54

Page 50: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

BasicsBasics

•• Examples:Examples:

A B A A

B BC

Serial episode Parallel More complex pepisode

pepisode with

serial and parallel

Course on Data MiningCourse on Data Mining 50Page50/54

Page 51: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

WINEPI ApproachWINEPI Approach

•• The name of the WINEPI method comes from the The name of the WINEPI method comes from the technique it uses: a sliding windowtechnique it uses: a sliding windowtechnique it uses: a sliding windowtechnique it uses: a sliding window

•• Intuitively: Intuitively: – A window is slided through the event-based dataA window is slided through the event-based data

sequence– Each window "snapshot" is like a row in a databasep– The collection of these "snapshots" forms the rows in

the database•• Complicated?Complicated?

– Not really, let's take a clarifying example

Course on Data MiningCourse on Data Mining 51Page51/54

Page 52: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

WINEPI ApproachWINEPI Approach

•• Example alarm data sequence:Example alarm data sequence:

0 10 20 30 40 50 60 70 80 90

D C A B D A B C

•• The window width is 40 seconds, last point excluded The window width is 40 seconds, last point excluded •• The first/last window contains only the first/last eventThe first/last window contains only the first/last event

Course on Data MiningCourse on Data Mining 52Page52/54

yy

Page 53: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

WINEPI ApproachWINEPI Approach

• Formally, given a set E of event types an event sequence an event sequence SS = (= (ss TT TT )) is an ordered sequence of events event suchS S = (= (ss,,TTss,T,Tee)) is an ordered sequence of events eventi such that eventi eventi+1 for all i=1, …, n-1, and Ts eventi < Te for all i=1, …, ne , ,

event1 event2 event3 … … eventn

T TTs Te

t1 t2 t3 … … tn

Course on Data MiningCourse on Data Mining 53Page53/54

Page 54: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

WINEPI ApproachWINEPI Approach

• Formally, a windowwindow on event sequence S is an event sequence S=(w t t ) where t < T t > T and w consistssequence S=(w,ts,te), where ts < Te, te > Ts, and w consists of those pairs (event, t) from s where ts t < te

• The value t t < t is called window width W• The value ts t < te is called window width, W

event1 event2 event3 … … eventn

T TTs Te

t1 t2 t3 tnWWttss ttee

Course on Data MiningCourse on Data Mining 54Page54/54

Page 55: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

WINEPI ApproachWINEPI Approach

• By definition, the first and the last windows on a sequence extend outside the sequence so that the last windowextend outside the sequence, so that the last window contains only the first time point of the sequence, and the last window only the last time pointy p

event1 event2 event3 … … eventn

T T

WWttss ttee

Ts Te

t1 t2 t3 tnWWttss ttee

Course on Data MiningCourse on Data Mining 55Page55/54

Page 56: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

WINEPI ApproachWINEPI Approach

• The frequencyfrequency (cf. support with association rules) of an episode is the fraction of windows in which the episodeepisode is the fraction of windows in which the episode occurs, i.e.,

|Sw W(S, W) | occurs in Sw |fr(, S, W) =

|W(S W)||W(S, W)|

where W(S, W) is the set of all windows Sw of sequence S such that the window width is Wsuch that the window width is W

Course on Data MiningCourse on Data Mining 56Page56/54

Page 57: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

WINEPI ApproachWINEPI Approach

• When searching for the episodes, a frequency thresholdfrequency threshold(cf support threshold with association rules) min fr is used(cf. support threshold with association rules) min_fr is used

• Episode is frequent if fr(, s, win) min_fr, i.e, "if the freq enc of e ceeds the minim m freq enc thresholdfrequency of exceeds the minimum frequency threshold within the data sequence s and with window width win"

F( i i f ) ll ti f f t i d i• F(s, win, min_fr): a collection of frequent episodes in swith respect to win and min_fr

•• Apriori trick holds:Apriori trick holds: if an episode is frequent in an event sequence s, then all subepisodes are frequent

Course on Data MiningCourse on Data Mining 57Page57/54

Page 58: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

WINEPI ApproachWINEPI Approach

•• FormallyFormally, an episode rule is as expression , where and are episodes such that is a subepisode of and are episodes such that is a subepisode of

• An episode is a subepisode of ( ), if the graph representation is a subgraph of the representation of representation is a subgraph of the representation of

A AA:

AC:

B B

Course on Data MiningCourse on Data Mining 58Page58/54

Page 59: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

WINEPI ApproachWINEPI Approach

• The fraction

fr(, S, W) = frequency of the whole episodefr( S W) = frequency of the LHS episodefr(, S, W) frequency of the LHS episode

is the confidenceconfidence of the WINEPI episode rulep

• The confidence can be interpreted as the conditional pprobability of the whole of occurring in a window, given that occurs in it

Course on Data MiningCourse on Data Mining 59Page59/54

Page 60: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

WINEPI ApproachWINEPI Approach

•• Intuitively: Intuitively: WINEPI l lik i i l b i h– WINEPI rules are like association rules, but with an additional time aspect: If events (alarms) satisfying the rule antecedent (leftIf events (alarms) satisfying the rule antecedent (left-hand side) occur in the right order within W time units, then also the rule consequent (right-hand side) occurs inthen also the rule consequent (right hand side) occurs in the location described by , also within W time units

antecedent antecedent consequent [window width] (f, c)consequent [window width] (f, c)

Course on Data MiningCourse on Data Mining 60Page60/54

Page 61: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

WINEPI AlgorithmWINEPI Algorithm•• InputInput: A set R of event/alarmtypes, an event sequence s over R, a set E

of episodes, a window width win, and a frequency threshold min_frO t tO t t Th ll i F( i i f )•• OutputOutput: The collection F(s, win, min_fr)

•• MethodMethod:1. compute C1 := { E | || = 1};p 1 { | | | };2. i := 1;3. while Ci do4 (* t F(s i i f ) { C | f ( s i ) i f }4.(* compute F(s, win, min_fr) := { Ci | fr(, s, win) min_fr};5. i := l+1;6.(** compute Ci:= { E | || = I, and F||(s, win, min_fr) for ||

all E, };

(* = database pass, (** candidate generation

Course on Data MiningCourse on Data Mining 61Page61/54

( database pass, ( candidate generation

Page 62: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

WINEPI AlgorithmWINEPI Algorithm

• First problem: given a sequence and a episode, find out whether the episode occurs in the sequencewhether the episode occurs in the sequence

• Finding the number of windows containing an occurrence of the episode can be reduced to this

• Successive windows have a lot in common• How to use this?

– An incremental algorithm– Same idea as for association rules– A candidate episode has to be a combination of two episodes ofA candidate episode has to be a combination of two episodes of

smaller size– Parallel episodes, serial episodes

Course on Data MiningCourse on Data Mining 62Page62/54

Page 63: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

WINEPI AlgorithmWINEPI Algorithm

•• Parallel episodes:Parallel episodes:F h did i i– For each candidate maintain a counter .event_count: how many events of are present in the windowWhen t t becomes eq al to || indicating– When .event_count becomes equal to ||, indicating that is entirely included in the window, save the starting time of the window in .inwindowstarting time of the window in .inwindow

– When .event_count decreases again, increase the field .freq count by the number of windows where f q_ yremainded entirely in the window

•• Serial episodes: use a state automataSerial episodes: use a state automata

Course on Data MiningCourse on Data Mining 63Page63/54

Page 64: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

WINEPI ApproachWINEPI Approach

•• Example alarm data sequence:Example alarm data sequence:

0 10 20 30 40 50 60 70 80 90

D C A B D A B C

•• The window width is 40 secs, The window width is 40 secs, movement stepmovement step 10 secs 10 secs •• The length of the sequence is 70 secs (10The length of the sequence is 70 secs (10--80)80)

Course on Data MiningCourse on Data Mining 64Page64/54

g q (g q ( ))

Page 65: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

WINEPI ApproachWINEPI Approach

•• By sliding the window, we'll get 11 windows (UBy sliding the window, we'll get 11 windows (U11--UU1111): ):

U2

U1

U2U11

0 10 20 30 40 50 60 70 80 90

D C A B D A B C

•• Frequency threshold is set to 40%, i.e., an episode has Frequency threshold is set to 40%, i.e., an episode has to occur at least in 5 of the 11 windowsto occur at least in 5 of the 11 windows

Course on Data MiningCourse on Data Mining 65Page65/54

Page 66: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

WINEPI ApproachWINEPI Approach

Course on Data MiningCourse on Data Mining 66Page66/54

Page 67: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

WINEPI ApproachWINEPI Approach

•• Suppose that the task is to find all parallel episodes:Suppose that the task is to find all parallel episodes:Fi t t i l t i ll l i d f i 1 (A B C D)– First, create singletons, i.e., parallel episodes of size 1 (A, B, C, D)

– Then, recognize the frequent singletons (here all are)– From those frequent episodes, build candidate episodes of size 2:From those frequent episodes, build candidate episodes of size 2:

AB, AC, AD, BC, BD, CD– Then, recongize the frequent parallel episodes (here all are)– From those frequent episodes, build candidate episodes of size 3:

ABC, ABD, ACD, BCD – When recognizing the frequent episodes, only ABD occurs in more g g q p , y

than four windows– There are no candidate episodes of size four

Course on Data MiningCourse on Data Mining 67Page67/54

Page 68: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

WINEPI ApproachWINEPI Approach

•• Episode frequencies and example rules with WINEPI:Episode frequencies and example rules with WINEPI:

D 73%D : 73%C : 73%A : 64%B : 64% D A [40] (55%, 75%)D C : 45%D A : 55%D A : 55%D B : 45% D A B [40] (45%, 82%)C A : 45%C B 45%C B : 45%A B : 45%D A B : 45%

Course on Data MiningCourse on Data Mining 68Page68/54

Page 69: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

WINEPI: Experimental ResultsWINEPI: Experimental Results

•• Data:Data:Al f l i i k– Alarms from a telecommunication network

– 73 000 events (7 weeks), 287 event typesll l d i l i d– Parallel and serial episodes

– Window widths (W) 10-120 seconds– Window movement = W/10– min_fr = 0.003 (0.3%), frequent: about 100 occurrences– 90 MHz Pentium, 32MB memory, Linux operating

system. The data resided in a 3.0 MB flat text file

Course on Data MiningCourse on Data Mining 69Page69/54

Page 70: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

WINEPI: Experimental ResultsWINEPI: Experimental Results

Window Serial episodes Parallel episodesWindow Serial episodes Parallel episodeswidth (s) #frequent time (s) #frequent time (s)10 16 31 10 820 31 63 17 940 57 117 33 1460 87 186 56 1560 87 186 56 1580 145 271 95 21100 245 372 139 21100 245 372 139 21120 359 478 189 22

Course on Data MiningCourse on Data Mining 70Page70/54

Page 71: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

WINEPI ApproachWINEPI Approach

•• One shortcoming in WINEPI approach:One shortcoming in WINEPI approach:C id h l f A d l f– Consider that two alarms of type A and one alarm of type B occur in a windowDoes the parallel episode consisting of A and B appear– Does the parallel episode consisting of A and B appear once or twice?If once then with which alarm of type A?– If once, then with which alarm of type A?

0 10 20 30 40 50 60 70 80 90

D C A B D A B C

Course on Data MiningCourse on Data Mining 71Page71/54

Page 72: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

MINEPI ApproachMINEPI Approach

•• Alternative approach to discovery of episodesAlternative approach to discovery of episodesN lidi i d– No sliding windows

– For each potentially interesting episode, find out the exact occurrences of the episodeexact occurrences of the episode

•• Advantages:Advantages: easy to modify time limits, several time limits for one rule:limits for one rule:

"If A and B occur within 15 seconds, then C follows within 30 seconds"within 30 seconds

•• Disadvantages:Disadvantages: uses a lots of space

Course on Data MiningCourse on Data Mining 72Page72/54

Page 73: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

MINEPI ApproachMINEPI Approach

• Formally, given a episode and an event sequence S, the interval [t t ] is a minimal occurrenceminimal occurrence of Sinterval [ts,te] is a minimal occurrenceminimal occurrence of S,– If occurs in the window corresponding to the interval

If does not occ r in an proper s binter al– If does not occur in any proper subinterval

Th t f i i lt f i i l f i d i• The set of minimal occurrencesset of minimal occurrences of an episode in a given event sequence is denoted by mo():

mo() = { [ts,te] | [ts,te] is a minimal occurrence of }

Course on Data MiningCourse on Data Mining 73Page73/54

Page 74: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

MINEPI ApproachMINEPI Approach

• Example: Parallel episode consisting of event types Aand B has three minimal occurrences in s: {[30 40]and B has three minimal occurrences in s: {[30,40], [40,60], [60,70]}, has one occurrence in s: {[60,80]}

A AC: :

B B

D C A B D A B C

0 10 20 30 40 50 60 70 80 90

Course on Data MiningCourse on Data Mining 74Page74/54

Page 75: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

MINEPI ApproachMINEPI Approach

•• InformallyInformally, a MINEPI episode rule gives the conditional probability that a certain combination of events (alarms)probability that a certain combination of events (alarms) occurs within some time bound, given that another combi-nation of events (alarms) has occurred within a time bound( )

• Formally, an episode ruleepisode rule is [win1] [win2]• and are episodes such that ( is a subepisode of and are episodes such that ( is a subepisode of

)• If episode has a minimal occurrence at interval [ts,te] p [ s e]

with te - ts win1, then episode occurs at interval [ts,t'e] for some t'e such that t'e - ts win2

Course on Data MiningCourse on Data Mining 75Page75/54

Page 76: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Pattern DiscoveryPattern Discovery1. Choose the language (formalism) to represent 

the patterns (search space)the patterns (search space)

2. Choose the rating for patterns, to tell  which is “better” than others

3 Design an algorithm that finds the best3. Design an algorithm that finds the best patterns from the pattern class, fast.

Brazma A, Jonassen I, Eidhammer I, Gilbert D.Brazma A, Jonassen I, Eidhammer I, Gilbert D.Approaches to the automatic discovery of patterns in biosequences.J Comput Biol. 1998;5(2):279-305.

Page 77: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Level 0 ATCGCTGAATTCCAATGTG

Level 1Eukaryotic genome can be

Level 2genome can be thought of as six Levels of DNA structureLevel 3 structure.

The loops at L l 4

Level 4Level 4 range from 0.5kb to 100kb in length.

Level 5If these loops were stabilized then the genes inside the loop would not be expressed.

Level 6expressed.

Page 78: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

DNA determines function (?)DNA determines function (?)

DNAGenBank / EMBL Bank

ProteinSwissProt/TrEMBL

StructurePDB/Molecular Structure DatabaseGenBank / EMBL Bank SwissProt/TrEMBL PDB/Molecular Structure Database

4 Nucleotides 20+ Amino Acids4 Nucleotides 20 Amino Acids(3nt 1 AA)

Function?

Page 79: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

A Simple GeneA Simple Gene

A: B: C:

ATCGAAAT +M difi ti

Upstream/promoter

Downstream

DNA ATCGAAATTAGCTTTA

+ModificationsDNA:

Page 80: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Species and individualsSpecies and individuals

• Animals, plantsfungi, bacteria, …g

S i• Species

• Individuals

www.tolweb.org

Page 81: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying
Page 82: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Gene Regulatory Signal Finding

Transcription Factor

Transcription Factor Binding Site

Goal: Detect Transcription Factor Binding SitesGoal: Detect Transcription Factor Binding Sites.Eleazar Eskin: Columbia Univ.

Page 83: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

TGTTCTTTCTTCTTTCATACATCCTTTTCCTTTTTTTCCTTCTCCTTTCATTTCCTGACTTTTAATATAGGCTTACCATTCTCCTTTCATTTCCTGACTTTTAATATAGGCTTACCATCCTTCTTCTCTTCAATAACCTTCTTACATTGCTTCTTCTTCGATTGCTTCAAAGTAGTTCGTGAATCATCCTTCAATGCCTCAGCACCTTCAGCACTTGCACTTCATTCTCTGGAAGCCTCAGCACCTTCAGCACTTGCACTTCATTCTCTGGAAGTGCTGCACCTGCGCTGTCTTGCTAATGGATTTGGAGTTGGCGTGGCACTGATTTCTTCGACATGGGCGGCGTCTTCTTCGAATTCCATCAGTCCTCATAGTTCTGTTGGTTCTTTTTCGAATTCCATCAGTCCTCATAGTTCTGTTGGTTCTTTTCTCTGATGATCGTCATCTTTCACTGATCTGATGTTCCTGTGCCCTATCTATATCATCTCAAAGTTCACCTTTGCCACTTTCCAAGATCTCTCATTCATAATGGGCTTAAAGCCGTACTTCCAAGATCTCTCATTCATAATGGGCTTAAAGCCGTACTTTTTTCACTCGATGAGCTATAAGAGTTTTCCACTTTTAGATCGTGGCTGGGCTTATATTACGGTGTGATGAGGGCGCTTGAAAAGATTTTTTCATCTCACAAGCGACGAGGGCCCGAGTGTTTGAAGCTAGATGCAGTAGGTGCAAGCGTAGAGTCTTAGAAGATAAAGTAGTGAATTACAATAGATTCGATAC

Page 84: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Patterns: ATPatterns:  AT

Page 85: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Patterns: [AT][ACT]ATIUPAC:    W    H    AT

Page 86: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Cluster of co‐expressed genes, di i l ipattern discovery in regulatory regions

600 basepairs

Expression profiles

Retrieve

Upstream regions

Find patterns over-represented within clusterGenome Research 1998; ISMB (Intelligent Systems in Mol. Biol.) 2000

Page 87: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Binomial or hypergeometric distribution

Background -ALL upstream

sequencesCluster:

occurs 3 times

P(3,6,0.2) is probabilityof having 3 matchesof having 3 matches in 6 sequences

0 2P(,3,6,0.2)

5 out of 25, p = 0.2 =0.0989

Page 88: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying
Page 89: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Pattern vs cluster “strength”Pattern vs cluster  strength

The pattern probability vs. the average silhouette for the cluster

The same for randomised clusters

Vilo et.al. ISMB 2000

Page 90: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

The most unprobable pattern from best clustersclusters

Pattern Probability Cluster Occurrences Total nr of Ksize in cluster occurrences in K-means

AAAATTTT 2.59E-43 96 72 830 60ACGCG 6 41E 39 96 75 1088 50ACGCG 6.41E-39 96 75 1088 50ACGCGT 5.23E-38 94 52 387 40CCTCGACTAA 5.43E-38 27 18 23 220GACGCG 7.89E-31 86 40 284 38TTTCGAAACTTACAAAAAT 2.08E-29 26 14 18 450TTCTTGTCAAAAAGC 2.08E-29 26 14 18 325ACATACTATTGTTAAT 3.81E-28 22 13 18 280GATGAGATG 5.60E-28 68 24 83 84TGTTTATATTGATGGA 1.90E-27 24 13 18 220GATGGATTTCTTGTCAAAA 5.04E-27 18 12 18 500GATGGATTTCTTGTCAAAA 5.04E 27 18 12 18 500TATAAATAGAGC 1.51E-26 27 13 18 300GATTTCTTGTCAAA 3.40E-26 20 12 18 700GATGGATTTCTTG 3.40E-26 20 12 18 875GGTGGCAA 4.18E-26 40 20 96 180TTCTTGTCAAAAAGCA 5 10E 26 29 13 18 250TTCTTGTCAAAAAGCA 5.10E-26 29 13 18 250CGAAACTTACAAA 5.10E-26 29 13 18 290GAAACTTACAAAAATAAA 7.92E-26 21 12 18 650TTTGTTTATATTG 1.74E-25 22 12 18 600ATCAACATACTATTGT 3.62E-25 23 12 18 375ATCAACATACTATTGTTA 3.62E-25 23 12 18 625GAACGCGCG 4.47E-25 20 11 13 260GTTAATTTCGAAAC 7.23E-25 24 12 18 400GGTGGCAAAA 3.37E-24 33 14 31 475ATCTTTTGTTTATATTGA 7 19E 24 19 11 18 675ATCTTTTGTTTATATTGA 7.19E-24 19 11 18 675TTTGTTTATATTGATGGA 7.19E-24 19 11 18 475GTGGCAAA 1.14E-23 28 18 137 725

Vilo et.al. ISMB 2000

Page 91: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

GGTGGCAA - proteasome associated control element

YOR261C YOR261C RPN8 protein degradation 26S proteasome regulatory subunit S0005787 1YDL020C YDL020C RPN4 i d d i bi i i 26S b i S0002178 1YDL020C YDL020C RPN4 protein degradation, ubiquitin26S proteasome subunit S0002178 1YDL007W YDL007W RPT2 protein degradation 26S proteasome subunit S0002165 1YDL147W YDL147W RPN5 protein degradation 26S proteasome subunit S0002306 1YOL038W YOL038W PRE6 protein degradation 20S proteasome subunit (alpha4) S0005398 1YKL145W YKL145W RPT1 protein degradation, ubiquitin26S proteasome subunit S0001628 1YDL097C YDL097C RPN6 protein degradation 26S proteasome regulatory subunit S0002255 1YDR394W YDR394W RPT3 protein degradation 26S proteasome subunit S0002802 1YBR173C YBR173C UMP1 t i d d ti bi iti 20S t t ti f t S0000377 1YBR173C YBR173C UMP1 protein degradation, ubiquitin20S proteasome maturation factor S0000377 1YER012W YER012W PRE1 protein degradation 20S proteasome subunit C11(beta4) S0000814 1YPR108W YPR108W RPN7 protein degradation 26S proteasome regulatory subunit S0006312 1YOR117W YOR117W RPT5 protein degradation 26S proteasome regulatory subunit S0005643 1YJL001W YJL001W PRE3 protein degradation 20S proteasome subunit (beta1) S0003538 1YPR103W YPR103W PRE2 protein degradation 20S proteasome subunit (beta5) S0006307 1YOR157C YOR157C PUP1 protein degradation 20S proteasome subunit (beta2) S0005683 1YGL048C YGL048C RPT6 t i d d ti 26S t l t b it S0003016 1YGL048C YGL048C RPT6 protein degradation 26S proteasome regulatory subunit S0003016 1YHR200W YHR200W RPN10 protein degradation 26S proteasome subunit S0001243 1YML092C YML092C PRE8 protein degradation 20S proteasome subunit Y7 (alpha2 S0004557 1YIL075C YIL075C RPN2 tRNA processing 26S proteasome subunit) S0001337 1YMR314W YMR314W PRE5 protein degradation 20S proteasome subunit(alpha6) S0004931 1YGR253C YGR253C PUP2 protein degradation 20S proteasome subunit(alpha5) S0003485 1YGR135W YGR135W PRE9 protein degradation 20S proteasome subunit Y13 (alpha3) S0003367 1YFR004W YFR004W RPN11 t i ti t ti l b l l t S0001900 1YFR004W YFR004W RPN11 transcription putative global regulator S0001900 1YOR259C YOR259C RPT4 protein degradation 26S proteasome regulatory subunit S0005785 1YFR052W YFR052W RPN12 protein degradation 26S proteasome regulatory subunit S0001948 1YFR050C YFR050C PRE4 protein degradation proteasome subunit, B type S0001946 1YGL011C YGL011C SCL1 protein degradation 20S proteasome subunit YC7ALPHA/Y8 S0002979 1YDR427W YDR427W RPN9 protein degradation 26S proteasome regulatory subunit S0002835 1YOR362C YOR362C PRE10 protein degradation 20S proteasome subunit C1 (alpha7) S0005889 1YBL041W YBL041W PRE7 t i d d ti 20S t b it S0000137 1YBL041W YBL041W PRE7 protein degradation 20S proteasome subunit S0000137 1YER021W YER021W RPN3 protein degradation 26S proteasome regulatory subunit S0000823 1YER094C YER094C PUP3 protein degradation 20S proteasome subunit (beta3 S0000896 1YGR270W YGR270W YTA7 protein degradation 26S proteasome subunit; ATPase S0003502 1YHR027C YHR027C RPN1 protein degradation 26S proteasome regulatory subunit S0001069 1YER047C YER047C SAP1 mating type switching AAA family protein S0000849 1YGR232W YGR232W unknown unknown S0003464 1

Page 92: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

>YAL036C chromo=1 coord=(76154-75048(C)) start=-600 end=+2 seq=(76152-76754)

TGTTCTTTCTTCTTCTGCTTCTCCTTTTCCTTTTTTTCCTTCTCCTTTTCCTTCTTGGACTTTAGTATAGGCTTACCATCCTTCTTCTCTTCAATAACCTTCTTTTCTTGCTTCTTCTTCGATTGCTTCAAAGTAGACATGAAGTCGCCTTCAATGGCCTCAGCACCTTCAGCACTTGCACTTGCTTCTCTGGAAGTGTCATCTGCACCTGCGCTGCTTTCTGGATTTGGAGTTGGCGTGGCACTGATTTCTTCGTTCTGGGCGGCGTCTTCTTCGAATTCCTCATCCCAGTAGTTCTGTTGGTTCTTTTTACTCTTTTTCGCCATCTTTCACTTATCTGATGTTCCTGATTGCCCTTCTTATCCCCTCAAAGTTCACCTTTGCCACTTATTCTAGTGC

101 Sequences relative to ORF startYGR128C + 100GTCTTCTTCGAATTCCTCATCCCAGTAGTTCTGTTGGTTCTTTTTACTCTTTTTCGCCATCTTTCACTTATCTGATGTTCCTGATTGCCCTTCTTATCCCCTCAAAGTTCACCTTTGCCACTTATTCTAGTGCAAGATCTCTTGCTTTCAATGGGCTTAAAGCTTGAAAAATTTTTTCACATCACAAGCGACGAGGGCCCGTTTTTTTCATCGATGAGCTATAAGAGTTTTCCACTTTTAAGATGGGATATTACGGTGTGATGAGGGCGCAATGATAGGAAGTGTTTGAAGCTAGATGCAGTAGGTGCAAGCGTAGAGTTGTTGATTGAGCAAA_ATG_>YAL025C chromo=1 coord=(101147-100230(C)) start=-600 end=+2 seq=(101145-101747)CTTAGAAGATAAAGTAGTGAATTACAATAAATTCGATACGAACGTTCAAATAGTCAAGAATTTCATTCAAAGGGTTCAATGGTCCAAGTTTTACACTTTCAAAGTTAACCACGAATTGCTGAGTAAGTGTGTTTATATTAGCACATTAACACAAGAAGAGATTAATGAACTATCCACATGAGGTATTGTGCCACTTTCCTCCAGTTCCCAAATTCCTCTTGTAAAAAACTTTGCATATAAAATATACAGATGGAGCATATATAGATGGAGCATACATACATGTTTTTTTTTTTTTAAAAACATGGACTCGAACAGAATAAAAGAATTTATAATGATAGATAATGCATACTTCAATAAGAGAGAATACTTGTTTTTAAATGAGAATTGCTTTCATTAGCTCATTATGTTCAGATTATCAAAATGCAGTAGGGTAATAAACCTTTTTTTTTTTTTTTTTTTTTTTTGAAAAATTTTCCGATGAGCTTTTGAAAAAAAATGAAAAAGTGATTGGTATAGAGGCAGATATTGCATTGCTTAGTTCTTTCTTTTGACAGTGTTCTCTTCAGTACATAACTACAACGGTTAGAATACAACGAGGAT_ATG_

...>YBR084W chromo=2 coord=(411012-413936) start=-600 end=+2 seq=(410412-411014)CCATGTATCCAAGACCTGCTGAAGATGCTTACAATGCCAATTATATTCAAGGTCTGCCCCAGTACCAAACATCTTATTTTTCGCAGCTGTTATTATCATCACCCCAGCATTACGAACATTCTCCACATCAAAGGAACTTTACGCCATCCAACCAATCGCATGGGAACTTTTATTAAATGTCTACATACATACATACATCTCGTACATAAATACGCATACGTATCTTCGTAGTAAGAACCGTCACAGATATGATTGAGCACGGTACAATTATGTATTAGTCAAACATTACCAGTTCTCGAACAAAACCAAAGCTACTCCTGCAACACTCTTCTATCGCACATGTATGGTTCTTATTGTTTCCCGAGTTCTTTTTTACTGACGCGCCAGAACGAGTAAGAAAGTTCTCTAGCGCCATGCTGAAATTTTTTTCACTTCAACGGACAGCGATTTTTTTTCTTTTTCCTCCGAAATAATGTTGCAGCGGTTCTCGATGCCTCAAGAATTGCAGAAGTAAACCAGCCAATACACATCAAAAAACAACTTTCATTACTGTGATTCTCTCAGTCTGTTCATTTGTCAGATATTTAAGGCTAAAAGGAA_ATG_

GATGAG.T 1:52/70 2:453/508 R:7.52345 BP:1.02391e-33G.GATGAG.T 1:39/49 2:193/222 R:13.244 BP:2.49026e-33AAAATTTT 1 63/77 2 833/911 R 4 95687 BP 5 02807 32AAAATTTT 1:63/77 2:833/911 R:4.95687 BP:5.02807e-32TGAAAA.TTT 1:45/53 2:333/350 R:8.85687 BP:1.69905e-31TG.AAA.TTT 1:53/61 2:538/570 R:6.45662 BP:3.24836e-31TG.AAA.TTTT 1:40/43 2:254/260 R:10.3214 BP:3.84624e-30TGAAA..TTT 1:54/65 2:608/645 R:5.82106 BP:1.0887e-29...

GATGAG TGATGAG.TTGAAA..TTT

Page 93: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Sequence patterns:Sequence patterns: the basis of the SPEXSthe basis of the SPEXS

A G A AT C GC C C

GCAT (4 positions)

GCATA (3 positions)

GCATA.GCATA.C

Page 94: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

SPEXS substringsSPEXS ‐ substrings

• enqueue( Q , Empty pattern (occurs everywhere) )

• while P = deque( Q )while P   deque( Q ) – check all positions of P.pos

F h i P– For every character c in P.pos• create pattern Pc

• advance all positions in Pc.pos by 1 

• enqueue( Q, Pc )

Jaak Vilo and other authors UT: Data Mining 2009 94

Page 95: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

SPEXS: count and memorizeSPEXS: count and memorize

i...v....x....v....xabracadabradadabraca

aa{1,4,6,8,11,13,15,18,20}

{2,5,7,9,12,14,16,19,21}

Page 96: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

SPEXS: extendSPEXS: extend …

i...v....x....v....xabracadabradadabraca

aa

b{2,5,7,9,12,14,16,19,21}

cb d{5,19}{2,9,16} {7,12,14}

Page 97: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

SPEXS: find frequent firstSPEXS: find frequent first

i...v....x....v....xabracadabradadabraca

aa

b{2,5,7,9,12,14,16,19,21}

b d{2,9,16} {7,12,14}

Page 98: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

SPEXS: group positionsSPEXS: group positions

i...v....x....v....xabracadabradadabraca

aa

b

{2,5,7,9,12,14,16,19,21}

[bd].

b d [bd]

{2,9,16} {7,12,14} {2,7,9,12,14,16}

Page 99: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

The wildcards

GCAT.{3,6}X

Page 100: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

The wildcards

GCAT.*X

Page 101: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

The wildcards: not too many

w:0

aw:0

.{3.6}b1w:1w:0

Page 102: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

SPEXS: general algorithmSPEXS: general algorithm1 S = input sequences ( ||S||=n )1. S input sequences ( ||S|| n )2. e = empty pattern, e.pos = {1,...,n}3. enqueue( order , e )

4. while p = dequeue( order ) 5. generate all allowed extensions p’ of p (& p’.pos)g p p ( p p )6. enqueue( order, p’, priority(p’) ) 7. enqueue( output, p’, fitness(p’) )

8. while p = dequeue( output )9. Output p

Jaak Vilo: Discovering Frequent Patterns from Strings.Technical Report C-1998-9 (pp. 20) May 1998. Department of Computer Science, University of Helsinki.

Applications in bioinformatics:

-Gene regulation (1998: 255+ citations, 2000: 73 cit)

Jaak Vilo: Pattern Discovery from Biosequences PhD Thesis, Department of Computer Science, University of Helsinki, Finland. Report A-2002-3 Helsinki, November 2002, 149 pages

-Functional elements in proteins (2002: 32 cit)

Page 103: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

SPEXS S P tt EXh ti S hSPEXS ‐ Sequence Pattern EXhaustive SearchJaak Vilo, 1998, 2002

• User‐definable pattern language: substrings, character groups, wildcards, flexible wildcards (c.f. PROSITE)

• Fast exhaustive search over pattern language ( )• “Lazy suffix tree construction”‐like algorithm (Kurtz, Giegerich)

• Analyze multiple sets of sequences simultaneouslyR t i t h t t f t tt l (i h t)• Restrict search to most frequent patterns only (in each set) 

• Reportmost frequent patterns, patterns over‐ or underrepresented in selected subsets, or patterns significant byunderrepresented in  selected subsets, or patterns significant by various statistical criteria, e.g. by binomial distribution

30min

Page 104: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Multiple data setsD1 D2 D3

4/3 (6) 3/3 (12) 2/2 (9)

Page 105: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

.G.GATGAG.T. 39 seq

.G.GATGAG.T. 39 seq (vs 193) p= 2.5e-33p

Page 106: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

-1: .G.GATGAG.T. 61 seq (vs 1292)

-1: .G.GATGAG.T. 61 seq (vs 1292) p= 1.4e-19p

Page 107: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

-2: .G.GATGAG.T. 91 seq

-2: .G.GATGAG.T. 91 seq (vs 5464)

Page 108: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

-3: .G.GATGAG.T. 98 seq

Page 109: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Jaak Vilo: Pattern Discovery from BiosequencesJaak Vilo: Pattern Discovery from Biosequences PhD Thesis, Department of Computer Science, University of Helsinki, FinlandSeries of Publications A, Report A-2002-3 Helsinki, November 2002, 149 pages

Page 110: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

-2: .G.GATGAG.T. 91 seq

These hits result in a PWM:

Page 111: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

PWM based on all previous hits, here shown highest-scoring occurrences in blue

Page 112: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

All against all approximate matching

For every subsequence of every sequence

Match approximately against all the the sequences.

Approximate hits define PWM matrices (not all positions vary equally).

Look for ALL PWM-s derived from data that are enriched in data set (vs. background).

Hendrik Nigul, Jaak Vilo

Page 113: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Dynamic programmingDynamic programming

• Small nr of edit operations allows to limit the search efficiently around main diagonaly g

Page 114: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Suffix TreeSuffix Tree

AC G T

G

G

T

{1:24,2:12,2:23…}

Page 115: (4AP 6EAP) - ut · • RapidMiner: free open‐source software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time‐varying

Trie based ll i t ll i t t hiall against all approximatematching

• trieindex

• trieagrep

• trieallagrep• trieallagrep

• triematrix

Hendrik Nigul, Jaak Vilo