stream, time, sequence mining

Slide 1

Mining Stream, Time-Series,and Sequence DataMd. Yasser ArafatMS Student, Dept of CSE, DUApril 22, 20151Topics CoveredMethodologies for Stream Data Processing AssociationTilted Time FrameCritical LayersLossy Counting AlgorithmHoeffding Tree Algorithm VFDT (Very Fast Decision Tree learner)Categories of Time-Series MovementsEstimation of Trend CurveSimilarity Search in Time-Series Analysis2April 22, 2015Methodologies for Stream Data Processing Random SamplingSliding WindowsHistogramsMultiresolution MethodsSketchesRandomized AlgorithmsTilted Time FrameNatural tilted time frameTime frame structured in multiple granularities based on the natural or usual time scaleExample: Minimal: quarter, then 4 quarters 1 hour, 24 hours day,

Tilted Time FrameLogarithmic tilted time frameTime frame is structured in multiple granularities according to a logarithmic scaleExample: Minimal: 1 minute, then 1, 2, 4, 8, 16, 32,

Tilted Time FrameProgressive logarithmic tilted time frame Snap-shots are stored at differing levels of granularity depending on the recencyExample: Suppose there are 5 frames and each takes maximal 3 snapshots. Given a snapshot number N, if N mod 2d = 0, insert into the frame number d. If there are more than 3 snapshots, kick out the oldest one.

Critical Layers

Lossy Counting AlgorithmUser provides two input parameters:Min support threshold, Error bound, Incoming stream is conceptually divided into buckets of widthw, d = 1/Approximate frequency count, fMaximum possible error, If a given item already exists, we simply increase its frequency count, f. Otherwise, we insert it into the list with a frequency count of 1.If the new item is from the bth bucket, we set to be b-1.Lossy Counting AlgorithmAn item entry is deleted if, for that entry,f + 0 G(Xa) - G(Xb) > 0 G(Xa) > G(Xb) Xa is the best attribute to split with probability 1- April 22, 2015Data Mining: Concepts and Techniques19yesnoPackets > 10Protocol = httpProtocol = ftpyesyesnoPackets > 10Bytes > 60KProtocol = httpData StreamData StreamAck. From Gehrkes SIGMOD tutorial slidesDecision-Tree Induction with Data StreamsApril 22, 2015Data Mining: Concepts and Techniques20Hoeffding Tree: Strengths and WeaknessesStrengths Scales better than traditional methodsSublinear with samplingVery small memory utilizationIncrementalMake class predictions in parallelNew examples are added as they comeWeaknessCould spend a lot of time with tiesMemory used with tree expansionNumber of candidate attributesVFDT (Very Fast Decision Tree learner)A learning system based on hoeffding tree algorithmImprovementsBreaking near tiesComputation of G()Memory utilizationDropping poor attributesInitialization method

Categories of Time-Series MovementsTrend or Long-term or movementsGeneral direction in which a time series is moving over a long interval of timeCyclic movements or cycle variationsLong term oscillations about a trend line or curvee.g., business cycles, may or may not be periodicSeasonal movements or seasonal variationsalmost identical patterns that a time series appears to follow during corresponding months of successive years.Irregular or random movementsTime series analysisdecomposition of a time series into these four basic movementsAdditive Modal: TS = T + C + S + IMultiplicative Modal: TS = T x C x S x I

Estimation of Trend CurveFreehand methodFit the curve by looking at the graphCostly and barely reliable for large-scaled data miningLeast-square methodFind the curve minimizing the sum of the squares of the deviation of points on the curve from the corresponding data pointsMoving AverageMoving average of order n

Smoothes the dataEliminates cyclic, seasonal and irregular movementsLoses the data at the beginning or end of a series

Similarity Search in Time-Series AnalysisNormal database query finds exact match Similarity search finds data sequences that differ only slightly from the given query sequenceTwo categories of similarity queriesWhole matching: find a sequence that is similar to the query sequenceSubsequence matching: find all pairs of similar sequencesTypical ApplicationsFinancial marketMarket basket data analysisScientific databasesMedical diagnosis25Data TransformationMany techniques for signal analysis require the data to be in the frequency domainReduction techniques discrete Fourier transform (DFT)discrete wavelet transform (DWT)The distance between two signals in the time domain is the same as their Euclidean distance in the frequency domain26Subsequence MatchingBreak each sequence into a set of pieces of window with length wExtract the features of the subsequence inside the windowMap each sequence to a trail in the feature spaceDivide the trail of each sequence into subtrails and represent each of them with minimum bounding rectangleUse a multi-piece assembly algorithm to search for longer sequence matches27Analysis of Similar Time Series

Steps for Performing a Similarity SearchAtomic matchingFind all pairs of gap-free windows of a small length that are similarWindow stitchingStitch similar windows to form pairs of large similar subsequences allowing gaps between atomic matchesSubsequence OrderingLinearly order the subsequence matches to determine whether enough similar pieces existReferenceChapter 6, Data Mining Concepts and Techniques, Third Edition. By Jiawei Han, Micheline Kamber and Jian Pei.30April 22, 2015Thank You31April 22, 2015

stream, time, sequence mining

Documents