data streams: models and algorithms - charu c. aggarwal

DATA STREAMS:MODELS AND ALGORITHMS

DATA STREAMS:MODELS AND ALGORITHMS

Edited by

CHARU C. AGGARWALIBM T. J. Watson Research Center, Yorktown Heights, NY 10598

Kluwer Academic PublishersBoston/Dordrecht/London

Contents

List of Figures xi

List of Tables xv

Preface xvii

1An Introduction to Data Streams 1Charu C. Aggarwal

1. Introduction 12. Stream Mining Algorithms 23. Conclusions and Summary 6References 7

2On Clustering Massive Data Streams: A Summarization Paradigm 9Charu C. Aggarwal, Jiawei Han, Jianyong Wang and Philip S. Yu

1. Introduction 102. The Micro-clustering Based Stream Mining Framework 123. Clustering Evolving Data Streams: A Micro-clustering Approach 17

3.1 Micro-clustering Challenges 183.2 Online Micro-cluster Maintenance: The CluStream Algo-

rithm 193.3 High Dimensional Projected Stream Clustering 22

4. Classification of Data Streams: A Micro-clustering Approach 234.1 On-Demand Stream Classification 24

5. Other Applications of Micro-clustering and Research Directions 266. Performance Study and Experimental Results 277. Discussion 36References 36

3A Survey of Classification Methods in Data Streams 39Mohamed Medhat Gaber, Arkady Zaslavsky and Shonali Krishnaswamy

1. Introduction 392. Research Issues 413. Solution Approaches 434. Classification Techniques 44

4.1 Ensemble Based Classification 454.2 Very Fast Decision Trees (VFDT) 46

vi DATA STREAMS: MODELS AND ALGORITHMS

4.3 On Demand Classification 484.4 Online Information Network (OLIN) 484.5 LWClass Algorithm 494.6 ANNCAD Algorithm 514.7 SCALLOP Algorithm 51

5. Summary 52References 53

4Frequent Pattern Mining in Data Streams 61Ruoming Jin and Gagan Agrawal

1. Introduction 612. Overview 623. New Algorithm 674. Work on Other Related Problems 795. Conclusions and Future Directions 80References 81

5A Survey of Change Diagnosis

Algorithms in Evolving DataStreams

85

Charu C. Aggarwal1. Introduction 862. The Velocity Density Method 88

2.1 Spatial Velocity Profiles 932.2 Evolution Computations in High Dimensional Case 952.3 On the use of clustering for characterizing stream evolution 96

3. On the Effect of Evolution in Data Mining Algorithms 974. Conclusions 100References 101

6Multi-Dimensional Analysis of Data

Streams Using Stream Cubes103

Jiawei Han, Y. Dora Cai, Yixin Chen, Guozhu Dong, Jian Pei, Benjamin W. Wah, andJianyong Wang

1. Introduction 1042. Problem Definition 1063. Architecture for On-line Analysis of Data Streams 108

3.1 Tilted time frame 1083.2 Critical layers 1103.3 Partial materialization of stream cube 111

4. Stream Data Cube Computation 1124.1 Algorithms for cube computation 115

5. Performance Study 1176. Related Work 1207. Possible Extensions 1218. Conclusions 122References 123

Contents vii

7Load Shedding in Data Stream Systems 127Brian Babcock, Mayur Datar and Rajeev Motwani

1. Load Shedding for Aggregation Queries 1281.1 Problem Formulation 1291.2 Load Shedding Algorithm 1331.3 Extensions 141

2. Load Shedding in Aurora 1423. Load Shedding for Sliding Window Joins 1444. Load Shedding for Classification Queries 1455. Summary 146References 146

8The Sliding-Window Computation Model and Results 149Mayur Datar and Rajeev Motwani

0.1 Motivation and Road Map 1501. A Solution to theBasicCounting Problem 152

1.1 The Approximation Scheme 1542. Space Lower Bound forBasicCounting Problem 1573. Beyond0’s and1’s 1584. References and Related Work 1635. Conclusion 164References 166

9A Survey of Synopsis Construction

in Data Streams169

Charu C. Aggarwal, Philip S. Yu1. Introduction 1692. Sampling Methods 172

2.1 Random Sampling with a Reservoir 1742.2 Concise Sampling 176

3. Wavelets 1773.1 Recent Research on Wavelet Decomposition in Data Streams 182

4. Sketches 1844.1 Fixed Window Sketches for Massive Time Series 1854.2 Variable Window Sketches of Massive Time Series 1854.3 Sketches and their applications in Data Streams 1864.4 Sketches withp-stable distributions 1904.5 The Count-Min Sketch 1914.6 Related Counting Methods: Hash Functions for Determining

Distinct Elements 1934.7 Advantages and Limitations of Sketch Based Methods 194

5. Histograms 1965.1 One Pass Construction of Equi-depth Histograms 1985.2 Constructing V-Optimal Histograms 1985.3 Wavelet Based Histograms for Query Answering 1995.4 Sketch Based Methods for Multi-dimensional Histograms200

6. Discussion and Challenges 200

viii DATA STREAMS: MODELS AND ALGORITHMS

References 202

10A Survey of Join Processing in

Data Streams209

Junyi Xie and Jun Yang1. Introduction 2092. Model and Semantics 2103. State Management for Stream Joins 213

3.1 Exploiting Constraints 2143.2 Exploiting Statistical Properties 216

4. Fundamental Algorithms for Stream Join Processing 2255. Optimizing Stream Joins 2276. Conclusion 230

Acknowledgments 232References 232

11Indexing and Querying Data Streams 237Ahmet Bulut, Ambuj K. Singh

1. Introduction 2382. Indexing Streams 239

2.1 Preliminaries and definitions 2392.2 Feature extraction 2402.3 Index maintenance 2442.4 Discrete Wavelet Transform 246

3. Querying Streams 2483.1 Monitoring an aggregate query 2483.2 Monitoring a pattern query 2513.3 Monitoring a correlation query 252

4. Related Work 2545. Future Directions 255

5.1 Distributed monitoring systems 2555.2 Probabilistic modeling of sensor networks 2565.3 Content distribution networks 256

6. Chapter Summary 257References 257

12Dimensionality Reduction and

Forecasting on Streams261

Spiros Papadimitriou, Jimeng Sun, and Christos Faloutsos1. Related work 2642. Principal component analysis (PCA) 2653. Auto-regressive models and recursive least squares 2674. MUSCLES 2695. Tracking correlations and hidden variables: SPIRIT 2716. Putting SPIRIT to work 2767. Experimental case studies 278

Contents ix

8. Performance and accuracy 2839. Conclusion 286

Acknowledgments 286

References 287

13A Survey of Distributed Mining of Data Streams 289Srinivasan Parthasarathy, Amol Ghoting and Matthew Eric Otey

1. Introduction 2892. Outlier and Anomaly Detection 2913. Clustering 2954. Frequent itemset mining 2965. Classification 2976. Summarization 2987. Mining Distributed Data Streams in Resource ConstrainedEnviron-

ments 2998. Systems Support 300References 304

14Algorithms for Distributed

Data Stream Mining309

Kanishka Bhaduri, Kamalika Das, Krishnamoorthy Sivakumar, Hillol Kargupta, RanWolff and Rong Chen

1. Introduction 3102. Motivation: Why Distributed Data Stream Mining? 3113. Existing Distributed Data Stream Mining Algorithms 3124. A local algorithm for distributed data stream mining 315

4.1 Local Algorithms : definition 3154.2 Algorithm details 3164.3 Experimental results 3184.4 Modifications and extensions 320

5. Bayesian Network Learning from Distributed Data Streams 3215.1 Distributed Bayesian Network Learning Algorithm 3225.2 Selection of samples for transmission to global site 3235.3 Online Distributed Bayesian Network Learning 3245.4 Experimental Results 326

6. Conclusion 326

References 329

15A Survey of Stream Processing

Problems and Techniquesin Sensor Networks

333

Sharmila Subramaniam, Dimitrios Gunopulos1. Challenges 334

x DATA STREAMS: MODELS AND ALGORITHMS

2. The Data Collection Model 3353. Data Communication 3354. Query Processing 337

4.1 Aggregate Queries 3384.2 Join Queries 3404.3 Top-k Monitoring 3414.4 Continuous Queries 341

5. Compression and Modeling 3425.1 Data Distribution Modeling 3435.2 Outlier Detection 344

6. Application: Tracking of Objects using Sensor Networks 3457. Summary 347References 348

Index 353

List of Figures

2.1 Micro-clustering Examples 11

2.2 Some Simple Time Windows 11

2.3 Varying Horizons for the classification process 23

2.4 Quality comparison (Network Intrusion dataset, horizon=256,streamspeed=200) 30

2.5 Quality comparison (Charitable Donation dataset, hori-zon=4, streamspeed=200) 30

2.6 Accuracy comparison (Network Intrusion dataset, streamspeed=80,buffer size=1600,kfit=80,init number=400) 31

2.7 Distribution of the (smallest) best horizon (Network In-trusion dataset, Time units=2500, buffersize=1600,kfit=80,init number=400) 31

2.8 Accuracy comparison (Synthetic dataset B300kC5D20,streamspeed=100, buffersize=500,kfit=25,init number=400) 31

2.9 Distribution of the (smallest) best horizon (Synthetic datasetB300kC5D20, Time units=2000, buffersize=500,kfit=25,init number=400) 32

2.10 Stream Proc. Rate (Charit. Donation data, streamspeed=2000)33

2.11 Stream Proc. Rate (Ntwk. Intrusion data, streamspeed=2000) 33

2.12 Scalability with Data Dimensionality (streamspeed=2000) 34

2.13 Scalability with Number of Clusters (streamspeed=2000) 34

3.1 The ensemble based classification method 53

3.2 VFDT Learning Systems 54

3.3 On Demand Classification 54

3.4 Online Information Network System 55

3.5 Algorithm Output Granularity 55

3.6 ANNCAD Framework 56

3.7 SCALLOP Process 56

4.1 Karpet al. Algorithm to Find Frequent Items 68

4.2 Improving Algorithm with An Accuracy Bound 71

xii DATA STREAMS: MODELS AND ALGORITHMS

4.3 StreamMining-Fixed: Algorithm Assuming Fixed LengthTransactions 73

4.4 Subroutines Description 73

4.5 StreamMining-Bounded: Algorithm with a Bound on Accuracy75

4.6 StreamMining: Final Algorithm 77

5.1 The Forward Time Slice Density Estimate 89

5.2 The Reverse Time Slice Density Estimate 89

5.3 The Temporal Velocity Profile 90

5.4 The Spatial Velocity Profile 90

6.1 A tilted time frame with natural time partition 108

6.2 A tilted time frame with logarithmic time partition 108

6.3 A tilted time frame with progressive logarithmic timepartition 109

6.4 Two critical layers in the stream cube 111

6.5 Cube structure from the m-layer to the o-layer 114

6.6 H-tree structure for cube computation 115

6.7 Cube computation: time and memory usage vs. # tuplesat them-layer for the data setD5L3C10 118

6.8 Cube computation: time and space vs. # of dimensionsfor the data setL3C10T100K 119

6.9 Cube computation: time and space vs. # of levels for the data set

D5C10T50K 120

7.1 Data Flow Diagram 130

7.2 Illustration of Example 7.1 137

7.3 Illustration of Observation 1.4 138

7.4 ProcedureSetSamplingRate(x,Rx) 139

8.1 Sliding window model notation 153

8.2 An illustration of an Exponential Histogram (EH). 160

9.1 Illustration of the Wavelet Decomposition 178

9.2 The Error Tree from the Wavelet Decomposition 179

10.1 Drifting normal distributions. 220

10.2 Example ECBs. 220

10.3 ECBs for sliding-window joins under the frequency-basedmodel. 222

10.4 ECBs under the age-based model. 222

11.1 The system architecture for a multi-resolution index struc-ture consisting of3 levels and stream-specific auto-regressive(AR) models for capturing multi-resolution trends in the data.240

11.2 Exact feature extraction, update rateT = 1. 241

11.3 Incremental feature extraction, update rateT = 1. 241

List of Figures xiii

11.4 Approximate feature extraction, update rateT = 1. 242

11.5 Incremental feature extraction, update rateT = 2. 243

11.6 Transforming an MBR using discrete wavelet transform.Transformation corresponds to rotating the axes (the ro-tation angle =45 for Haar wavelets) 247

11.7 Aggregate query decomposition and approximation com-position for a query window of sizew = 26. 249

11.8 Subsequence query decomposition for a query windowof size|Q| = 9. 253

12.1 Illustration of problem. 262

12.2 Illustration of updatingw1 when a new pointxt+1 arrives. 266

12.3 Chlorine dataset. 279

12.4 Mote dataset. 280

12.5 Critter dataset 281

12.6 Detail of forecasts onCritter with blanked values. 282

12.7 River data. 283

12.8 Wall-clock times (including time to update forecasting models).284

12.9 Hidden variable tracking accuracy. 285

13.1 Centralized Stream Processing Architecture (left) Dis-tributed Stream Processing Architecture (right) 291

14.1 (A) the area inside anǫ circle. (B) Seven evenly spacedvectors -u1 . . .u7. (C) The borders of the seven halfs-pacesui · x ≥ ǫ define a polygon in which the circle iscircumscribed. (D) The area between the circle and theunion of half-spaces. 318

14.2 Quality of the algorithm with increasing number of nodes 319

14.3 Cost of the algorithm with increasing number of nodes 319

14.4 ASIA Model 322

14.5 Bayesian network for online distributed parameter learning327

14.6 Simulation results for online Bayesian learning: (left) KLdistance between the conditional probabilities for the net-worksBol(k) andBbe for three nodes (right) KL distancebetween the conditional probabilities for the networksBol(k) andBba for three nodes 328

15.1 An instance of dynamic cluster assignment in sensor sys-tem according to LEACH protocol. Sensor nodes of thesame clusters are shown with same symbol and the clusterheads are marked with highlighted symbols. 336

xiv DATA STREAMS: MODELS AND ALGORITHMS

15.2 Interest Propagation, gradient setup and path reinforce-ment for data propagation indirected-diffusionparadigm.Event is described in terms of attribute value pairs. Thefigure illustrates an event detected based on the locationof the node and target detection. 336

15.3 Sensors aggregating the result for a MAX queryin-network 337

15.4 Error filter assignments in tree topology. The nodes thatare shown shaded are thepassivenodes that take partonly in routing the measurements. A sensor communi-cates a measurement only if it lies outside the interval ofvalues specified byEi i.e., maximum permitted error atthe node. A sensor that receives partial results from itschildren aggregates the results and communicates themto its parent after checking against the error interval 339

15.5 Usage of duplicate-sensitive sketches to allow result prop-agation to multiple parents providing fault tolerance. Thesystem is divided intolevelsduring the query propaga-tion phase. Partial results from a higher level (level2 inthe figure) is received at more than one node in the lowerlevel (Level1 in the figure) 339

15.6 (a) Two dimensional Gaussian model of the measure-ments from sensorsS1 andS2 (b) The marginal distri-bution of the values of sensorS1, givenS2: New obser-vations from one sensor is used to estimate theposteriordensityof the other sensors 343

15.7 Estimation of probability distribution of the measure-ments over sliding window 344

15.8 Trade-offs in modeling sensor data 345

15.9 Tracking a target. The leader nodes estimate the prob-ability of the target’s direction and determines the nextmonitoring region that the target is going to traverse. Theleaders of the cells within the next monitoring region arealerted 347

List of Tables

2.1 An example of snapshots stored forα = 2 andl = 2 15

2.2 A geometric time window 17

3.1 Data Based Techniques 44

3.2 Task Based Techniques 44

3.3 Typical LWClass Training Results 49

3.4 Summary of Reviewed Techniques 53

4.1 Algorithms for Frequent Itemsets Mining over Data Streams64

8.1 Summary of results for the sliding-window model. 165

9.1 An Example of Wavelet Coefficient Computation 177

12.1 Description of notation. 267

12.2 Description of datasets. 278

12.3 Reconstruction accuracy (mean squared error rate). 285

Preface

In recent years, the progress in hardware technology has made it possiblefor organizations to store and record large streams of transactional data. Suchdata sets which continuously and rapidly grow over time are referred to as datastreams. In addition, the development of sensor technology has resulted inthe possibility of monitoring many events in real time. While data mining hasbecome a fairly well established field now, the data stream problem poses anumber of unique challenges which are not easily solved by traditional datamining methods.

The topic of data streams is a very recent one. The first research papers onthis topic appeared slightly under a decade ago, and since then this field hasgrown rapidly. There is a large volume of literature which has been publishedin this field over the past few years. The work is also of great interest topractitioners in the field who have to mine actionable insights with large volumesof continuously growing data. Because of the large volume of literature in thefield, practitioners and researchers may often find it an arduous task ofisolatingthe right literature for a given topic. In addition, from a practitioners pointofview, the use of research literature is even more difficult, since much of therelevant material is buried in publications. While handling a real problem, itmay often be difficult to know where to look in order to solve the problem.

This book contains contributed chapters from a variety of well known re-searchers in the data mining field. While the chapters will be written by dif-ferent researchers, the topics and content will be organized in such away so asto present the most important models, algorithms, and applications in the datamining field in a structured and concise way. In addition, the book is organizedin order to make it more accessible to application driven practitioners. Giventhe lack of structurally organized information on the topic, the book will pro-vide insights which are not easily accessible otherwise. In addition, the bookwill be a great help to researchers and graduate students interested in thetopic.The popularity and current nature of the topic of data streams is likely to makeit an important source of information for researchers interested in the topic.The data mining community has grown rapidly over the past few years, and thetopic of data streams is one of the most relevant and current areas of interest to

xviii DATA STREAMS: MODELS AND ALGORITHMS

the community. This is because of the rapid advancement of the field of datastreams in the past two to three years. While the data stream field clearly fallsin the emerging category because of its recency, it is now beginning to reach amaturation and popularity point, where the development of an overview bookon the topic becomes both possible and necessary. While this book attempts toprovide an overview of the stream mining area, it also tries to discuss currenttopics of interest so as to be useful to students and researchers. It is hoped thatthis book will provide a reference to students, researchers and practitioners inboth introducing the topic of data streams and understanding the practical andalgorithmic aspects of the area.

Chapter 1

AN INTRODUCTION TO DATA STREAMS

Charu C. AggarwalIBM T. J. Watson Research CenterHawthorne, NY 10532

[email protected]

AbstractIn recent years, advances in hardware technology have facilitated new ways of

collecting data continuously. In many applications such as network monitoring,the volume of such data is so large that it may be impossible to store the dataon disk. Furthermore, even when the data can be stored, the volume of theincoming data may be so large that it may be impossible to process any particularrecord more than once. Therefore, many data mining and database operationssuch as classification, clustering, frequent pattern mining and indexing becomesignificantly more challenging in this context.

In many cases, the data patterns may evolve continuously, as a result ofwhichit is necessary to design the mining algorithms effectively in order to account forchanges in underlying structure of the data stream. This makes the solutions of theunderlying problems even more difficult from an algorithmic and computationalpoint of view. This book contains a number of chapters which are carefully chosenin order to discuss the broad research issues in data streams. The purpose of thischapter is to provide an overview of the organization of the stream processingand mining techniques which are covered in this book.

1. Introduction

In recent years, advances in hardware technology have facilitated theabilityto collect data continuously. Simple transactions of everyday life such as usinga credit card, a phone or browsing the web lead to automated data storage.Similarly, advances in information technology have lead to large flows of dataacross IP networks. In many cases, these large volumes of data can be mined forinteresting and relevant information in a wide variety of applications. When the

2 DATA STREAMS: MODELS AND ALGORITHMS

volume of the underlying data is very large, it leads to a number of computationaland mining challenges:

With increasing volume of the data, it is no longer possible to process thedata efficiently by using multiple passes. Rather, one can process a dataitem at most once. This leads to constraints on the implementation of theunderlying algorithms. Therefore, stream mining algorithms typicallyneed to be designed so that the algorithms work with one pass of thedata.

In most cases, there is an inherent temporal component to the streammining process. This is because the data may evolve over time. Thisbehavior of data streams is referred to astemporal locality. Therefore,a straightforward adaptation of one-pass mining algorithms may not bean effective solution to the task. Stream mining algorithms need to becarefully designed with a clear focus on the evolution of the underlyingdata.

Another important characteristic of data streams is that they are often mined ina distributed fashion. Furthermore, the individual processors may havelimitedprocessing and memory. Examples of such cases include sensor networks, inwhich it may be desirable to perform in-network processing of data streamwithlimited processing and memory [8, 19]. This book will also contain a numberof chapters devoted to these topics.

This chapter will provide an overview of the different stream mining algo-rithms covered in this book. We will discuss the challenges associated with eachkind of problem, and discuss an overview of the material in the correspondingchapter.

2. Stream Mining Algorithms

In this section, we will discuss the key stream mining problems and willdiscuss the challenges associated with each problem. We will also discuss anoverview of the material covered in each chapter of this book. The broadtopicscovered in this book are as follows:

Data Stream Clustering. Clustering is a widely studied problem in thedata mining literature. However, it is more difficult to adapt arbitrary clus-tering algorithms to data streams because of one-pass constraints on the dataset. An interesting adaptation of thek-means algorithm has been discussedin [14] which uses a partitioning based approach on the entire data set. Thisapproach uses an adaptation of ak-means technique in order to create clustersover the entire data stream. In the context of data streams, it may be moredesirable to determine clusters in specific user defined horizons rather than on

An Introduction to Data Streams 3

the entire data set. In chapter 2, we discuss the micro-clustering technique [3]which determines clusters over the entire data set. We also discuss a varietyof applications of micro-clustering which can perform effective summarizationbased analysis of the data set. For example, micro-clustering can be extendedto the problem of classification on data streams [5]. In many cases, it can alsobe used for arbitrary data mining applications such as privacy preserving datamining or query estimation.

Data Stream Classification. The problem of classification is perhaps oneof the most widely studied in the context of data stream mining. The problemof classification is made more difficult by the evolution of the underlying datastream. Therefore, effective algorithms need to be designed in order to taketemporal locality into account. In chapter 3, we discuss a survey of classifica-tion algorithms for data streams. A wide variety of data stream classificationalgorithms are covered in this chapter. Some of these algorithms are designedtobe purely one-pass adaptations of conventional classification algorithms [12],whereas others (such as the methods in [5, 16]) are more effective in account-ing for the evolution of the underlying data stream. Chapter 3 discusses thedifferent kinds of algorithms and the relative advantages of each.

Frequent Pattern Mining. The problem of frequent pattern mining wasfirst introduced in [6], and was extensively analyzed for the conventional caseof disk resident data sets. In the case of data streams, one may wish to find thefrequent itemsets either over a sliding window or the entire data stream [15, 17].In Chapter 4, we discuss an overview of the different frequent patternminingalgorithms, and also provide a detailed discussion of some interesting recentalgorithms on the topic.

Change Detection in Data Streams. As discussed earlier, the patternsin a data stream may evolve over time. In many cases, it is desirable to trackand analyze the nature of these changes over time. In [1, 11, 18], a number ofmethods have been discussed for change detection of data streams. In addition,data stream evolution can also affect the behavior of the underlying data miningalgorithms since the results can become stale over time. Therefore, in Chapter5, we have discussed the different methods for change detection data streams.We have also discussed the effect of evolution on data stream mining algorithms.

Stream Cube Analysis of Multi-dimensional Streams. Much of streamdata resides at a multi-dimensional space and at rather low level of abstraction,whereas most analysts are interested in relatively high-level dynamic changes insome combination of dimensions. To discover high-level dynamic and evolvingcharacteristics, one may need to perform multi-level, multi-dimensional on-line


analytical processing (OLAP) of stream data. Such necessity calls for the inves-tigation of new architectures that may facilitate on-line analytical processing ofmulti-dimensional stream data [7, 10].

In Chapter 6, an interestingstream cube architecture that effectively per-forms on-line partial aggregation of multi-dimensional stream data, capturesthe essential dynamic and evolving characteristics of data streams, and facil-itates fast OLAP on stream data. Stream cube architecture facilitates onlineanalytical processing of stream data. It also forms a preliminary structureforonline stream mining. The impact of the design and implementation of streamcube in the context of stream mining is also discussed in the chapter.

Loadshedding in Data Streams. Since data streams are generated byprocesses which are extraneous to the stream processing application, itis notpossible to control the incoming stream rate. As a result, it is necessary forthesystem to have the ability to quickly adjust to varying incoming stream pro-cessing rates. Chapter 7 discusses one particular type of adaptivity: theabilityto gracefully degrade performance via “load shedding” (dropping unprocessedtuples to reduce system load) when the demands placed on the system can-not be met in full given available resources. Focusing on aggregation queries,the chapter presents algorithms that determine at what points in a query planshould load shedding be performed and what amount of load should be shed ateach point in order to minimize the degree of inaccuracy introduced into queryanswers.

Sliding Window Computations in Data Streams. Many of the synopsisstructures discussed use the entire data stream in order to construct the cor-responding synopsis structure. The sliding-window model of computation ismotivated by the assumption that it is more important to use recent data in datastream computation [9]. Therefore, the processing and analysis is only done ona fixed history of the data stream. Chapter 8 formalizes this model of compu-tation and answers questions about how much space and computation time isrequired to solve certain problems under the sliding-window model.

Synopsis Construction in Data Streams. The large volume of data streamsposes unique space and time constraints on the computation process. Manyquery processing, database operations, and mining algorithms require efficientexecution which can be difficult to achieve with a fast data stream. In manycases, it may be acceptable to generateapproximate solutionsfor such prob-lems. In recent years a number ofsynopsis structureshave been developed,which can be used in conjunction with a variety of mining and query process-ing techniques [13]. Some key synopsis methods include those of sampling,wavelets, sketches and histograms. In Chapter 9, a survey of the key synopsis


techniques is discussed, and the mining techniques supported by such methods.The chapter discusses the challenges and tradeoffs associated with using dif-ferent kinds of techniques, and the important research directions for synopsisconstruction.

Join Processing in Data Streams. Stream join is a fundamental operationfor relating information from different streams. This is especially useful inmany applications such as sensor networks in which the streams arriving fromdifferent sources may need to be related with one another. In the stream setting,input tuples arrive continuously, and result tuples need to be producedcontinu-ously as well. We cannot assume that the input data is already stored or indexed,or that the input rate can be controlled by the query plan. Standard join algo-rithms that use blocking operations, e.g., sorting, no longer work. Conventionalmethods for cost estimation and query optimization are also inappropriate, be-cause they assume finite input. Moreover, the long-running nature of streamqueries calls for more adaptive processing strategies that can react to changesand fluctuations in data and stream characteristics. The “stateful” nature ofstream joins adds another dimension to the challenge. In general, in order tocompute the complete result of a stream join, we need to retain all past arrivalsas part of the processing state, because a new tuple may join with an arbitrarilyold tuple arrived in the past. This problem is exacerbated by unbounded inputstreams, limited processing resources, and high performance requirements, asit is impossible in the long run to keep all past history in fast memory. Chap-ter 10 provides an overview of research problems, recent advances, and futureresearch directions in stream join processing.

Indexing Data Streams. The problem of indexing data streams attemptsto create a an indexed representation, so that it is possible to efficiently answerdifferent kinds of queries such as aggregation queries or trend based queries.This is especially important in the data stream case because of the huge vol-ume of the underlying data. Chapter 11 explores the problem of indexing andquerying data streams.

Dimensionality Reduction and Forecasting in Data Streams. Becauseof the inherent temporal nature of data streams, the problems of dimension-ality reduction and forecasting and particularly important. When there are alarge number of simultaneous data stream, we can use the correlations betweendifferent data streams in order to make effective predictions [20, 21] onthefuture behavior of the data stream. In Chapter 12, an overview of dimensional-ity reduction and forecasting methods have been discussed for the problem ofdata streams. In particular, the well known MUSCLES method [21] has beendiscussed, and its application to data streams have been explored. In addition,


the chapter presents the SPIRIT algorithm, which explores the relationship be-tween dimensionality reduction and forecasting in data streams. In particular,the chapter explores the use of a compact number of hidden variables to com-prehensively describe the data stream. This compact representation canalso beused for effective forecasting of the data streams.

Distributed Mining of Data Streams. In many instances, streams aregenerated at multiple distributed computing nodes. Analyzing and monitoringdata in such environments requires data mining technology that requires opti-mization of a variety of criteria such as communication costs across differentnodes, as well as computational, memory or storage requirements at each node.A comprehensive survey of the adaptation of different conventional mining al-gorithms to the distributed case is provided in Chapter 13. In particular, theclustering, classification, outlier detection, frequent pattern mining, and sum-marization problems are discussed. In Chapter 14, some recent advances instream mining algorithms are discussed.

Stream Mining in Sensor Networks. With recent advances in hardwaretechnology, it has become possible to track large amounts of data in a distributedfashion with the use of sensor technology. The large amounts of data collectedby the sensor nodes makes the problem of monitoring a challenging one frommany technological stand points. Sensor nodes have limited local storage,computational power, and battery life, as a result of which it is desirable tominimize the storage, processing and communication from these nodes. Theproblem is further magnified by the fact that a given network may have millionsof sensor nodes and therefore it is very expensive to localize all the data at a givenglobal node for analysis both from a storage and communication point of view.In Chapter 15, we discuss an overview of a number of stream mining issuesin the context of sensor networks. This topic is closely related to distributedstream mining, and a number of concepts related to sensor mining have alsobeen discussed in Chapters 13 and 14.

3. Conclusions and Summary

Data streams are a computational challenge to data mining problems becauseof the additional algorithmic constraints created by the large volume of data. Inaddition, the problem of temporal locality leads to a number of unique miningchallenges in the data stream case. This chapter provides an overview to thedifferent mining algorithms which are covered in this book. We discussed thedifferent problems and the challenges which are associated with each problem.We also provided an overview of the material in each chapter of the book.


References

[1] Aggarwal C. (2003). A Framework for Diagnosing Changes in EvolvingData Streams.ACM SIGMOD Conference.

[2] Aggarwal C (2002). An Intuitive Framework for understanding Changes inEvolving Data Streams.IEEE ICDE Conference.

[3] Aggarwal C., Han J., Wang J., Yu P (2003). A Framework for ClusteringEvolving Data Streams.VLDB Conference.

[4] Aggarwal C., Han J., Wang J., Yu P (2004). A Framework for High Dimen-sional Projected Clustering of Data Streams.VLDB Conference.

[5] Aggarwal C, Han J., Wang J., Yu P. (2004). On-Demand Classification ofData Streams.ACM KDD Conference.

[6] Agrawal R., Imielinski T., Swami A. (1993) Mining Association Rulesbetween Sets of items in Large Databases.ACM SIGMOD Conference.

[7] Chen Y., Dong G., Han J., Wah B. W., Wang J. (2002) Multi-dimensionalregression analysis of time-series data streams.VLDB Conference.

[8] Cormode G., Garofalakis M. (2005) Sketching Streams Through the Net:Distributed Approximate Query Tracking.VLDB Conference.

[9] Datar M., Gionis A., Indyk P., Motwani R. (2002) Maintaining streamstatistics over sliding windows.SIAM Journal on Computing, 31(6):1794–1813.

[10] Dong G., Han J., Lam J., Pei J., Wang K. (2001) Mining multi-dimensionalconstrained gradients in data cubes.VLDB Conference.

[11] Dasu T., Krishnan S., Venkatasubramaniam S., Yi K. (2005).An Information-Theoretic Approach to Detecting Changes in Multi-dimensional data Streams.Duke University Technical Report CS-2005-06.

[12] Domingos P. and Hulten G. (2000) Mining High-Speed Data Streams. InProceedings of the ACM KDD Conference.

[13] Garofalakis M., Gehrke J., Rastogi R. (2002) Querying and mining datastreams: you only get one look (a tutorial).SIGMOD Conference.

[14] Guha S., Mishra N., Motwani R., O’Callaghan L. (2000). Clustering DataStreams.IEEE FOCS Conference.

[15] Giannella C., Han J., Pei J., Yan X., and Yu P. (2002) Mining FrequentPatterns in Data Streams at Multiple Time Granularities.Proceedings ofthe NSF Workshop on Next Generation Data Mining.

[16] Hulten G., Spencer L., Domingos P. (2001). Mining Time Changing DataStreams.ACM KDD Conference.

[17] Jin R., Agrawal G. (2005) An algorithm for in-core frequent itemsetmin-ing on streaming data.ICDM Conference.


[18] Kifer D., David S.-B., Gehrke J. (2004). Detecting Change in DataStreams.VLDB Conference, 2004.

[19] Kollios G., Byers J., Considine J., Hadjielefttheriou M., Li F. (2005) Ro-bust Aggregation in Sensor Networks.IEEE Data Engineering Bulletin.

[20] Sakurai Y., Papadimitriou S., Faloutsos C. (2005). BRAID: Stream miningthrough group lag correlations.ACM SIGMOD Conference.

[21] Yi B.-K., Sidiropoulos N.D., Johnson T., Jagadish, H. V., Faloutsos C.,Biliris A. (2000). Online data mining for co-evolving time sequences.ICDEConference.

Chapter 2

ON CLUSTERING MASSIVE DATA STREAMS: ASUMMARIZATION PARADIGM


[email protected]

Jiawei HanUniversity of Illinois at Urbana-ChampaignUrbana, IL

[email protected]

Jianyong WangUniversity of Illinois at Urbana-ChampaignUrbana, IL

[email protected]

Philip S. YuIBM T. J. Watson Research CenterHawthorne, NY 10532

[email protected]

AbstractIn recent years, data streams have become ubiquitous because of thelarge

number of applications which generate huge volumes of data in an automatedway. Many existing data mining methods cannot be applied directly on datastreams because of the fact that the data needs to be mined in one pass. Fur-thermore, data streams show a considerable amount of temporal localitybecauseof which a direct application of the existing methods may lead to misleadingresults. In this paper, we develop an efficient and effective approach for min-ing fast evolving data streams, which integrates themicro-clusteringtechnique


with the high-level data mining process, and discovers data evolution regularitiesas well. Our analysis and experiments demonstrate two important data miningproblems, namelystream clusteringandstream classification, can be performedeffectively using this approach, with high quality mining results. We discussthe use of micro-clustering as a general summarization technology to solve datamining problems on streams. Our discussion illustrates the importance of ourapproach for a variety of mining problems in the data stream domain.

1. Introduction

In recent years, advances in hardware technology have allowed us toauto-matically record transactions and other pieces of information of everyday lifeat a rapid rate. Such processes generate huge amounts of online data whichgrow at an unlimited rate. These kinds of online data are referred to asdatastreams. The issues on management and analysis of data streams have beenresearched extensively in recent years because of its emerging, imminent,andbroad applications [11, 14, 17, 23].

Many important problems such as clustering and classification have beenwidely studied in the data mining community. However, a majority of suchmethods may not be working effectively on data streams. Data streams posespecial challenges to a number of data mining algorithms, not only becauseof the huge volume of the online data streams, but also because of the factthat the data in the streams may show temporal correlations. Such temporalcorrelations may help disclose important data evolution characteristics, and theycan also be used to develop efficient and effective mining algorithms. Moreover,data streams requireonline mining, in which we wish to mine the data in acontinuous fashion. Furthermore, the system needs to have the capability toperform anoffline analysisas well based on the user interests. This is similarto an online analytical processing (OLAP) framework which uses the paradigmof pre-processing once, querying many times.

Based on the above considerations, we propose a new stream mining frame-work, which adopts a tilted time window framework, takes micro-clusteringas a preprocessing process, and integrates the preprocessing with theincre-mental, dynamic mining process. Micro-clustering preprocessing effectivelycompresses the data, preserves the general temporal locality of data, and facili-tates both online and offline analysis, as well as the analysis of current data anddata evolution regularities.

In this study, we primarily concentrate on the application of this techniqueto two problems: (1) stream clustering, and (2) stream classification. The heartof the approach is to use an online summarization approach which is efficientand also allows for effective processing of the data streams. We also discuss

On Clustering Massive Data Streams: A Summarization Paradigm 11

a

b

c

a1

a2

bc

Figure 2.1. Micro-clustering Examples

(a)

Time

4 qtrs 24 hours 31 days 12 months

Now

(b)

4 qtrs 24 hours 31 days time Now

15 minutes

(c)

Time t 8t 4t 2t t 16t 32t 64t

Now

Figure 2.2. Some Simple Time Windows

a number of research directions, in which we show how the approach canbeadapted to a variety of other problems.

This paper is organized as follows. In the next section, we will present ourmicro-clustering based stream mining framework. In section 3, we discuss thestream clustering problem. The classification methods are developed in Section4. In section 5, we discuss a number of other problems which can be solvedwith the micro-clustering approach, and other possible research directions. Insection 6, we will discuss some empirical results for the clustering and classi-fication problems. In Section 7 we discuss the issues related to our proposedstream mining methodology and compare it with other related work. Section 8concludes our study.


2. The Micro-clustering Based Stream MiningFramework

In order to apply our technique to a variety of data mining algorithms, weutilize a micro-clustering based stream mining framework. This framework isdesigned by capturing summary information about the nature of the data stream.This summary information is defined by the following structures:•Micro-clusters: We maintain statistical information about the data locality

in terms of micro-clusters. These micro-clusters are defined as a temporalextension of thecluster feature vector[24]. The additivity property of themicro-clusters makes them a natural choice for the data stream problem.• Pyramidal Time Frame: The micro-clusters are stored at snapshots in

time which follow a pyramidal pattern. This pattern provides an effective trade-off between the storage requirements and the ability to recall summary statisticsfrom different time horizons.

The summary information in the micro-clusters is used by an offline com-ponent which is dependent upon a wide variety of user inputs such as thetimehorizon or the granularity of clustering. In order to define the micro-clusters,we will introduce a few concepts. It is assumed that the data stream consistsof a set of multi-dimensional recordsX1 . . .Xk . . . arriving at time stampsT1 . . . Tk . . .. EachXi is a multi-dimensional record containingd dimensionswhich are denoted byXi = (x1

i . . . xdi ).

We will first begin by defining the concept of micro-clusters and pyramidaltime frame more precisely.

Definition 2.1 A micro-cluster for a set ofd-dimensional pointsXi1 . . . Xin

with time stampsTi1 . . . Tin is the(2·d+3) tuple(CF2x, CF1x, CF2t, CF1t, n),whereinCF2x andCF1x each correspond to a vector ofd entries. The defi-nition of each of these entries is as follows:• For each dimension, the sum of the squares of the data values is maintained

in CF2x. Thus,CF2x containsd values. Thep-th entry ofCF2x is equal to∑nj=1(x

pij

)2.

• For each dimension, the sum of the data values is maintained inCF1x.Thus,CF1x containsd values. Thep-th entry ofCF1x is equal to

∑nj=1 x

pij

.• The sum of the squares of the time stampsTi1 . . . Tin is maintained in

CF2t.• The sum of the time stampsTi1 . . . Tin is maintained inCF1t.• The number of data points is maintained inn.

We note that the above definition of micro-cluster maintains similar summaryinformation as the cluster feature vector of [24], except for the additional in-formation about time stamps. We will refer to this temporal extension of thecluster feature vector for a set of pointsC byCFT (C). As in [24], this summary


information can be expressed in an additive way over the different data points.This makes it a natural choice for use in data stream algorithms.

We note that the maintenance of a large number of micro-clusters is essentialin the ability to maintain more detailed information about the micro-clusteringprocess. For example, Figure 2.1 forms 3 clusters, which are denoted bya, b, c.At a later stage, evolution forms 3 different figures a1, a2, bc, with a splitinto a1and a2, whereas b and c merged into bc. If we keep micro-clusters (eachpointrepresents a micro-cluster), such evolution can be easily captured. However, ifwe keep only 3 cluster centers a, b, c, it is impossible to derive later a1, a2,bcclusters since the information of more detailed points are already lost.

The data stream clustering algorithm discussed in this paper can generateapproximate clusters in any user-specified length of history from the currentinstant. This is achieved by storing the micro-clusters at particular momentsin the stream which are referred to assnapshots. At the same time, the currentsnapshot of micro-clusters is always maintained by the algorithm. The macro-clustering algorithm discussed at a later stage in this paper will use these finerlevel micro-clusters in order to create higher level clusters which can be moreeasily understood by the user. Consider for example, the case when the currentclock time istc and the user wishes to find clusters in the stream based ona history of lengthh. Then, the macro-clustering algorithm discussed in thispaper will use some of the additive properties of the micro-clusters stored atsnapshotstc and(tc − h) in order to find the higher level clusters in a historyor time horizonof lengthh. Of course, since it is not possible to store thesnapshots at each and every moment in time, it is important to choose particularinstants of time at which it is possible to store the state of the micro-clusters sothat clusters in any user specified time horizon(tc−h, tc) can be approximated.

We note that some examples of time frames used for the clustering processare the natural time frame (Figure 2.2(a) and (b)), and the logarithmic timeframe (Figure 2.2(c)). In the natural time frame the snapshots are stored atregular intervals. We note that the scale of the natural time frame could bebased on the application requirements. For example, we could choose days,months or years depending upon the level of granularity required in the analysis.A more flexible approach is to use the logarithmic time frame in which differentvariations of the time interval can be stored. As illustrated in Figure 2.2(c), westore snapshots at times oft, 2 · t, 4 · t . . .. The danger of this is that we mayjump too far between successive levels of granularity. We need an intermediatesolution which provides a good balance between storage requirements andthelevel of approximation which a user specified horizon can be approximated.

In order to achieve this, we will introduce the concept of a pyramidal timeframe. In this technique, the snapshots are stored at differing levels of granular-ity depending upon the recency. Snapshots are classified into differentorderswhich can vary from 1 to log(T ), whereT is the clock time elapsed since the


beginning of the stream. The order of a particular class of snapshots definethe level of granularity in time at which the snapshots are maintained. Thesnapshots of different order are maintained as follows:• Snapshots of thei-th order occur at time intervals ofαi, whereα is an

integer andα ≥ 1. Specifically, each snapshot of thei-th order is taken ata moment in time when the clock value1 from the beginning of the stream isexactly divisible byαi.• At any given moment in time, only the lastα+ 1 snapshots of orderi are

stored.We note that the above definition allows for considerable redundancy in

storage of snapshots. For example, the clock time of 8 is divisible by20, 21,22, and23 (whereα = 2). Therefore, the state of the micro-clusters at a clocktime of 8 simultaneously corresponds to order 0, order 1, order 2 and order3 snapshots. From an implementation point of view, a snapshot needs to bemaintained only once. We make the following observations:• For a data stream, the maximum order of any snapshot stored atT time

units since the beginning of the stream mining process is logα(T ).• For a data stream the maximum number of snapshots maintained atT time

units since the beginning of the stream mining process is(α+ 1) · logα(T ).• For any user specified time window ofh, at least one stored snapshot can

be found within2 · h units of the current time.While the first two results are quite easy to see, the last one needs to be

proven formally.

Lemma 2.2 Leth be a user-specified time window,tc be the current time, andts be the time of the last stored snapshot of any order just before the timetc−h.Thentc − ts ≤ 2 · h.

Proof: Let r be the smallest integer such thatαr ≥ h. Therefore, we know thatαr−1 < h. Since we know that there areα+1 snapshots of order(r−1), at leastone snapshot of orderr−1mustalwaysexist beforetc−h. Letts be the snapshotof orderr − 1 which occurs just beforetc − h. Then(tc − h) − ts ≤ αr−1.Therefore, we havetc − ts ≤ h+ αr−1 < 2 · h.

Thus, in this case, it is possible to find a snapshot within a factor of2 ofany user-specified time window. Furthermore, the total number of snapshotswhich need to be maintained are relatively modest. For example, for a datastream running for 100 years with a clock time granularity of 1 second, thetotal number of snapshots which need to be maintained are given by(2 + 1) ·log2(100 ∗ 365 ∗ 24 ∗ 60 ∗ 60) ≈ 95. This is quite a modest requirement giventhe fact that a snapshot within a factor of 2 can always be found within any userspecified time window.

It is possible to improve the accuracy of time horizon approximation at amodest additional cost. In order to achieve this, we save theαl + 1 snapshots


Order ofSnapshots

Clock Times (Last 5 Snapshots)

0 55 54 53 52 511 54 52 50 48 462 52 48 44 40 363 48 40 32 24 164 48 32 165 32

Table 2.1. An example of snapshots stored forα = 2 andl = 2

of orderr for l > 1. In this case, the storage requirement of the techniquecorresponds to(αl +1) · logα(T ) snapshots. On the other hand, the accuracy oftime horizon approximation also increases substantially. In this case, any timehorizon can be approximated to a factor of(1 + 1/αl−1). We summarize thisresult as follows:

Lemma 2.3 Leth be a user specified time horizon,tc be the current time, andts be the time of the last stored snapshot of any order just before the timetc−h.Thentc − ts ≤ (1 + 1/αl−1) · h.

Proof: Similar to previous case.For larger values ofl, the time horizon can be approximated as closely as

desired. For example, by choosingl = 10, it is possible to approximate anytime horizon within0.2%, while a total of only(210 + 1) · log2(100 ∗ 365 ∗24 ∗ 60 ∗ 60) ≈ 32343 snapshots are required for 100 years. Since historicalsnapshots can be stored on disk and only the current snapshot needsto bemaintained in main memory, this requirement is quite feasible from a practicalpoint of view. It is also possible to specify the pyramidal time window inaccordance with user preferences corresponding to particular momentsin timesuch as beginning of calendar years, months, and days. While the storagerequirements and horizon estimation possibilities of such a scheme are different,all the algorithmic descriptions of this paper are directly applicable.

In order to clarify the way in which snapshots are stored, let us considerthecase when the stream has been running starting at a clock-time of 1, and a useof α = 2 andl = 2. Therefore22 + 1 = 5 snapshots of each order are stored.Then, at a clock time of 55, snapshots at the clock times illustrated in Table 2.1are stored.

We note that a large number of snapshots are common among different orders.From an implementation point of view, the states of the micro-clusters at timesof 16, 24, 32, 36, 40, 44, 46, 48, 50, 51, 52, 53, 54, and 55 are stored. It is easyto see that for more recent clock times, there is less distance between succes-sive snapshots (better granularity). We also note that the storage requirements


estimated in this section do not take this redundancy into account. Therefore,the requirements which have been presented so far are actually worst-case re-quirements.

These redundancies can be eliminated by using a systematic rule describedin [6], or by using a more sophisticated geometric time frame. In this technique,snapshots are classified into differentframe numberswhich can vary from 0 to avalue no larger than log2(T ), whereT is the maximum length of the stream. Theframe number of a particular class of snapshots defines the level of granularityin time at which the snapshots are maintained. Specifically, snapshots of framenumberi are stored at clock times which are divisible by2i, but not by2i+1.Therefore, snapshots of frame number 0 are stored only at odd clock times. Itis assumed that for each frame number, at mostmax capacity snapshots arestored.

We note that for a data stream, the maximum frame number of any snapshotstored atT time units since the beginning of the stream mining process islog2(T ). Since at mostmax capacity snapshots of any order are stored, thisalso means that the maximum number of snapshots maintained atT time unitssince the beginning of the stream mining process is(max capacity) · log2(T ).One interesting characteristic of the geometric time window is that for any user-specified time window ofh, at least one stored snapshot can be found withina factor of 2 of the specified horizon. This ensures that sufficient granularityis available for analyzing the behavior of the data stream over different timehorizons. We will formalize this result in the lemma below.

Lemma 2.4 Leth be a user-specified time window, andtc be the current time.Let us also assume thatmax capacity ≥ 2. Then a snapshot exists at timets,such thath/2 ≤ tc − ts ≤ 2 · h.

Proof: Let r be the smallest integer such thath < 2r+1. Sincer is the smallestsuch integer, it also means thath ≥ 2r. This means that for any interval(tc − h, tc) of lengthh, at least one integert′ ∈ (tc − h, tc) must exist whichsatisfies the property thatt′ mod2r−1 = 0 andt′ mod2r 6= 0. Lett′ be the timestamp of the last (most current) such snapshot. This also means the following:

h/2 ≤ tc − t′ < h (2.1)

Then, ifmax capacity is at least 2, the second last snapshot of order(r−1)is also stored and has a time-stamp value oft′ − 2r. Let us pick the timets = t′ − 2r. By substituting the value ofts, we get:

tc − ts = (tc − t′ + 2r) (2.2)

Since(tc − t′) ≥ 0 and2r > h/2, it easily follows from Equation 2.2 thattc − ts > h/2.


Frame no. Snapshots (by clock time)

0 69 67 651 70 66 622 68 60 523 56 40 244 48 165 64 32

Table 2.2. A geometric time window

Sincet′ is the position of the latest snapshot of frame(r−1) occurring beforethe current timetc, it follows that(tc − t′) ≤ 2r. Subsituting this inequality inEquation 2.2, we gettc − ts ≤ 2r + 2r ≤ h+ h = 2 · h. Thus, we have:

h/2 ≤ tc − ts ≤ 2 · h (2.3)

The above result ensures that every possible horizon can be closely approx-imated within a modest level of accuracy. While the geometric time frameshares a number of conceptual similarities with the pyramidal time frame [6],it is actually quite different and also much more efficient. This is because iteliminates the double counting of the snapshots over different frame numbers,as is the case with the pyramidal time frame [6]. In Table 2.2, we presentan example of a frame table illustrating snapshots of different frame numbers.The rules for insertion of a snapshott (at timet) into the snapshot frame tableare defined as follows: (1) if(t mod2i) = 0 but (t mod2i+1) 6= 0, t is in-serted intoframe number i (2) each slot has amax capacity (which is 3 inour example). At the insertion oft into frame number i, if the slot alreadyreaches itsmax capacity, the oldest snapshot in this frame is removed andthe new snapshot inserted. For example, at time 70, since(70 mod21) = 0but (70 mod22) 6= 0, 70 is inserted into framenumber1 which knocks outthe oldest snapshot 58 if the slot capacity is 3. Following this rule, when slotcapacity is 3, the following snapshots are stored in the geometric time windowtable: 16, 24, 32, 40, 48, 52, 56, 60, 62, 64, 65, 66, 67, 68, 69, 70, as shown inTable 2.2. From the table, one can see that the closer to the current time, thedenser are the snapshots stored.

3. Clustering Evolving Data Streams: A Micro-clusteringApproach

The clustering problem is defined as follows: for a given set of data points,we wish to partition them into one or more groups of similar objects. Thesimilarity of the objects with one another is typically defined with the use ofsome distance measure or objective function. The clustering problem has been


widely researched in the database, data mining and statistics communities [12,18, 22, 20, 21, 24] because of its use in a wide range of applications. Recently,the clustering problem has also been studied in the context of the data streamenvironment [17, 23].

A previous algorithm called STREAM [23] assumes that the clusters are to becomputed over the entire data stream. While such a task may be useful in manyapplications, a clustering problem may often be defined only over a portion ofa data stream. This is because a data stream should be viewed as an infiniteprocess consisting of data which continuously evolves with time. As a result,the underlying clusters may also change considerably with time. The nature ofthe clusters may vary with both the moment at which they are computed as wellas the time horizon over which they are measured. For example, a data analystmay wish to examine clusters occurring in the last month, last year, or lastdecade. Such clusters may be considerably different. Therefore, weassumethat one of the inputs to the clustering algorithm is a time horizon over whichthe clusters are found. Next, we will discuss CluStream, the online algorithmused for clustering data streams.

3.1 Micro-clustering Challenges

We note that since stream data naturally imposes a one-pass constraint on thedesign of the algorithms, it becomes more difficult to provide such a flexibilityin computing clusters over different kinds of time horizons using conventionalalgorithms. For example, a direct extension of the stream basedk-means algo-rithm in [23] to such a case would require a simultaneous maintenance of theintermediate results of clustering algorithms over all possible time horizons.Such a computational burden increases with progression of the data stream andcan rapidly become a bottleneck for online implementation. Furthermore, inmany cases, an analyst may wish to determine the clusters at a previous momentin time, and compare them to the current clusters. This requires even greaterbook-keeping and can rapidly become unwieldy for fast data streams.

Since a data stream cannot be revisited over the course of the computation,the clustering algorithm needs to maintain a substantial amount of informationso that important details are not lost. For example, the algorithm in [23] isimplemented as a continuous version ofk-means algorithm which continuesto maintain a number of cluster centers which change or merge as necessarythroughout the execution of the algorithm. Such an approach is especially riskywhen the characteristics of the stream change over time. This is because theamount of information maintained by ak-means type approach is too approxi-mate in granularity, and once two cluster centers are joined, there is no way toinformatively split the clusters when required by the changes in the stream at alater stage.


Therefore a natural design to stream clustering would be separate out the pro-cess into an online micro-clustering component and an offline macro-clusteringcomponent. The online micro-clustering component requires a very efficientprocess for storage of appropriate summary statistics in a fast data stream.Theoffline component uses these summary statistics in conjunction with other userinput in order to provide the user with a quick understanding of the clusterswhenever required. Since the offline component requires only the summarystatistics as input, it turns out to be very efficient in practice. This leads toseveral challenges:• What is the nature of the summary information which can be stored ef-

ficiently in a continuous data stream? The summary statistics should providesufficient temporal and spatial information for a horizon specific offline clus-tering process, while being prone to an efficient (online) update process.• At what moments in time should the summary information be stored away

on disk? How can an effective trade-off be achieved between the storage re-quirements of such a periodic process and the ability to cluster for a specifictime horizon to within a desired level of approximation?• How can the periodic summary statistics be used to provide clustering and

evolution insights over user-specified time horizons?

3.2 Online Micro-cluster Maintenance: The CluStreamAlgorithm

The micro-clustering phase is the online statistical data collection portionof the algorithm. This process is not dependent on any user input such as thetime horizon or the required granularity of the clustering process. The aimis to maintain statistics at a sufficiently high level of (temporal and spatial)granularity so that it can be effectively used by the offline components suchas horizon-specific macro-clustering as well as evolution analysis. The basicconcept of the micro-cluster maintenance algorithm derives ideas from thek-means and nearest neighbor algorithms. The algorithm works in an iterativefashion, by always maintaining a current set of micro-clusters. It is assumed thata total ofq micro-clusters are stored at any moment by the algorithm. We willdenote these micro-clusters byM1 . . .Mq. Associated with each micro-clusteri, we create a uniqueid whenever it is first created. If two micro-clusters aremerged (as will become evident from the details of our maintenance algorithm),a list of ids is created in order to identify the constituent micro-clusters. Thevalue ofq is determined by the amount of main memory available in order tostore the micro-clusters. Therefore, typical values ofq are significantly largerthan the natural number of clusters in the data but are also significantly smallerthan the number of data points arriving in a long period of time for a massivedata stream. These micro-clusters represent the current snapshot ofclusters


which change over the course of the stream as new points arrive. Theirstatus isstored away on disk whenever the clock time is divisible byαi for any integeri. At the same time any micro-clusters of orderr which were stored at a timein the past more remote thanαl+r units are deleted by the algorithm.

We first need to create the initialq micro-clusters. This is done using anoffline process at the very beginning of the data stream computation process.At the very beginning of the data stream, we store the firstInitNumber pointson disk and use a standardk-means clustering algorithm in order to create theq initial micro-clusters. The value ofInitNumber is chosen to be as large aspermitted by the computational complexity of ak-means algorithm creatingqclusters.

Once these initial micro-clusters have been established, the online processofupdating the micro-clusters is initiated. Whenever a new data pointXik arrives,the micro-clusters are updated in order to reflect the changes. Each datapointeither needs to be absorbed by a micro-cluster, or it needs to be put in a cluster ofits own. The first preference is to absorb the data point into a currently existingmicro-cluster. We first find the distance of each data point to the micro-clustercentroidsM1 . . .Mq. Let us denote this distance value of the data pointXikto the centroid of the micro-clusterMj by dist(Mj , Xik). Since the centroidof the micro-cluster is available in the cluster feature vector, this value can becomputed relatively easily.

We find the closest clusterMp to the data pointXik . We note that in manycases, the pointXik does not naturally belong to the clusterMp. These casesare as follows:• The data pointXik corresponds to an outlier.• The data pointXik corresponds to the beginning of a new cluster because

of evolution of the data stream.While the two cases above cannot be distinguished until more data points

arrive, the data pointXik needs to be assigned a (new) micro-cluster of its ownwith a uniqueid. How do we decide whether a completely new cluster shouldbe created? In order to make this decision, we use the cluster feature vectorofMp to decide if this data point falls within themaximum boundaryof themicro-clusterMp. If so, then the data pointXik is added to the micro-clusterMp using the CF additivity property. The maximum boundary of the micro-clusterMp is defined as a factor oft of the RMS deviation of the data pointsinMp from the centroid. We define this as themaximal boundary factor. Wenote that the RMS deviation can only be defined for a cluster with more than1 point. For a cluster with only 1 previous point, the maximum boundary isdefined in a heuristic way. Specifically, we choose it to ber times that of thenext closest cluster.

If the data point does not lie within the maximum boundary of the nearestmicro-cluster, then a new micro-cluster must be created containing the data


point Xik . This newly created micro-cluster is assigned a new id which canidentify it uniquely at any future stage of the data steam process. However,in order to create this new micro-cluster, the number of other clusters mustbe reduced by one in order to create memory space. This can be achievedbyeither deleting an old cluster or joining two of the old clusters. Our maintenancealgorithm first determines if it is safe to delete any of the current micro-clustersas outliers. If not, then a merge of two micro-clusters is initiated.

The first step is to identify if any of the old micro-clusters are possibly out-liers which can be safely deleted by the algorithm. While it might be temptingto simply pick the micro-cluster with the fewest number of points as the micro-cluster to be deleted, this may often lead to misleading results. In many cases,a given micro-cluster might correspond to a point of considerable clusterpres-ence in the past history of the stream, but may no longer be an active clusterin the recent stream activity. Such a micro-cluster can be considered an out-lier from the current point of view. An ideal goal would be to estimate theaverage timestamp of the lastm arrivals in each micro-cluster2, and deletethe micro-cluster with the least recent timestamp. While the above estimationcan be achieved by simply storing the lastm points in each micro-cluster, thisincreases the memory requirements of a micro-cluster by a factor ofm. Sucha requirement reduces the number of micro-clusters that can be stored bytheavailable memory and therefore reduces the effectiveness of the algorithm.

We will find a way to approximate the average timestamp of the lastm datapoints of the clusterM. This will be achieved by using the data about thetimestamps stored in the micro-clusterM. We note that the timestamp dataallows us to calculate the mean and standard deviation3 of the arrival times ofpoints in a given micro-clusterM. Let these values be denoted byµM andσM respectively. Then, we find the time of arrival of them/(2·n)-th percentileof the points inM assuming that the timestamps are normally distributed. Thistimestamp is used as the approximate value of the recency. We shall call thisvalue as therelevance stampof clusterM. When the least relevance stamp ofany micro-cluster is below a user-defined thresholdδ, it can be eliminated anda new micro-cluster can be created with a uniqueid corresponding to the newlyarrived data pointXik .

In some cases, none of the micro-clusters can be readily eliminated. Thishappens when all relevance stamps are sufficiently recent and lie abovetheuser-defined thresholdδ. In such a case, two of the micro-clusters need to bemerged. We merge the two micro-clusters which are closest to one another.The new micro-cluster no longer corresponds to oneid. Instead, anidlist iscreated which is a union of the theids in the individual micro-clusters. Thus,any micro-cluster which is result of one or more merging operations can beidentified in terms of the individual micro-clusters merged into it.


While the above process of updating is executed at the arrival of each datapoint, an additional process is executed at each clock time which is divisibleby αi for any integeri. At each such time, we store away the current set ofmicro-clusters (possibly on disk) together with their id list, and indexed by theirtime of storage. We also delete the least recent snapshot of orderi, if αl + 1snapshots of such order had already been stored on disk, and if the clock time forthis snapshot is not divisible byαi+1. (In the latter case, the snapshot continuesto be a viable snapshot of order(i+1).) These micro-clusters can then be usedto form higher level clusters or an evolution analysis of the data stream.

3.3 High Dimensional Projected Stream Clustering

The method can also be extended to the case of high dimensional projectedstream clustering . The algorithms is referred to as HPSTREAM. The high-dimensional case presents a special challenge to clustering algorithms eveninthe traditional domain of static data sets. This is because of the sparsity ofthe data in the high-dimensional case. In high-dimensional space, all pairsof points tend to be almost equidistant from one another. As a result, it isoften unrealistic to define distance-based clusters in a meaningful way. Somerecent work on high-dimensional data uses techniques forprojected clusteringwhich can determine clusters for a specific subset of dimensions [1, 4]. In thesemethods, the definitions of the clusters are such that each cluster is specificto a particular group of dimensions. This alleviates the sparsity problem inhigh-dimensional space to some extent. Even though a cluster may not bemeaningfully defined on all the dimensions because of the sparsity of the data,some subset of the dimensions can always be found on which particular subsetsof points form high quality and meaningful clusters. Of course, these subsetsof dimensions may vary over the different clusters. Such clusters are referredto asprojected clusters[1].

In [8], we have discussed methods for high dimensional projected clusteringof data streams. The basic idea is to use an (incremental) algorithm in whichwe associate a set of dimensions with each cluster. The set of dimensions isrepresented as ad-dimensional bit vectorB(Ci) for each cluster structure inFCS. This bit vector contains a 1 bit for each dimension which is includedin clusterCi. In addition, the maximum number of clustersk and the averagecluster dimensionalityl is used as an input parameter. The average clusterdimensionalityl represents the average number of dimensions used in the clusterprojection. An iterative approach is used in which the dimensions are used toupdate the clusters and vice-versa. The structure inFCS uses a decay-basedmechanism in order to adjust for evolution in the underlying data stream. Detailsare discussed in [8].


x xx x

--

- xxxx

- -

Time t2

- -

- -- -

x xx

x

Time

Feature

Value

Time t1

Figure 2.3. Varying Horizons for the classification process

4. Classification of Data Streams: A Micro-clusteringApproach

One important data mining problem which has been studied in the context ofdata streams is that of stream classification [15]. The main thrust on data streammining in the context of classification has been that of one-pass mining [14, 19].In general, the use of one-pass mining does not recognize the changeswhichhave occurred in the model since the beginning of the stream constructionprocess [5]. While the work in [19] works on time changing data streams,the focus is on providing effective methods for incremental updating of theclassification model. We note that the accuracy of such a model cannot begreater than the best sliding window model on a data stream. For example, inthe case illustrated in Figure 2.3, we have illustrated two classes (labeled by’x’ and ’-’) whose distribution changes over time. Correspondingly, thebesthorizon at timest1 andt2 will also be different. As our empirical results willshow, the true behavior of the data stream is captured in a temporal model whichis sensitive to the level of evolution of the data stream.

The classification process may require simultaneous model construction andtesting in an environment which constantly evolves over time. We assume thatthe testing process is performed concurrently with the training process. Thisis often the case in many practical applications, in which only a portion ofthe data is labeled, whereas the remaining is not. Therefore, such data canbe separated out into the (labeled) training stream, and the (unlabeled) testingstream. The main difference in the construction of the micro-clusters is thatthe micro-clusters are associated with a class label; therefore an incoming datapoint in the training stream can only be added to a micro-cluster belonging tothe same class. Therefore, we construct micro-clusters in almost the same wayas the unsupervised algorithm, with an additional class-label restriction.

From the testing perspective, the important point to be noted is that the mosteffective classification model does not stay constant over time, but varies with


progression of the data stream. If a static classification model were used foran evolving test stream, the accuracy of the underlying classification processis likely to drop suddenly when there is a sudden burst of records belonging toa particular class. In such a case, a classification model which is constructedusing a smaller history of data is likely to provide better accuracy. In othercases, a longer history of training provides greater robustness.

In the classification process of an evolving data stream, either the shortterm or long term behavior of the stream may be more important, and it oftencannot be known a-priori as to which one is more important. How do wedecide the window or horizon of the training data to use so as to obtain the bestclassification accuracy? While techniques such as decision trees are useful forone-pass mining of data streams [14, 19], these cannot be easily used in thecontext of anon-demand classifierin an evolving environment. This is becausesuch a classifier requires rapid variation in the horizon selection processdueto data stream evolution. Furthermore, it is too expensive to keep track ofthe entire history of the data in its original fine granularity. Therefore, theon-demand classification process still requires the appropriate machineryforefficient statistical data collection in order to perform the classification process.

4.1 On-Demand Stream Classification

We use the micro-clusters to perform anOn Demand Stream ClassificationProcess. In order to perform effective classification of the stream, it is importantto find the correct time-horizon which should be used for classification. Howdo we find the most effective horizon for classification at a given moment intime? In order to do so, a small portion of the training stream is not usedfor the creation of the micro-clusters. This portion of the training stream isreferred to as the horizon fitting stream segment. The number of points in thestream used for horizon fitting is denoted bykfit. The remaining portion of thetraining stream is used for the creation and maintenance of the class-specificmicro-clusters as discussed in the previous section.

Since the micro-clusters are based on the entire history of the stream, theycannot directly be used to test the effectiveness of the classification process overdifferent time horizons. This is essential, since we would like to find the timehorizon which provides the greatest accuracy during the classification process.We will denote the set of micro-clusters at timetc and horizonh byN (tc, h).This set of micro-clusters is determined by subtracting out the micro-clustersat time tc − h from the micro-clusters at timetc. The subtraction operationis naturally defined for the micro-clustering approach. The essential ideaisto match the micro-clusters at timetc to the micro-clusters at timetc − h,and subtract out the corresponding statistics. The additive property ofmicro-


clusters ensures that the resulting clusters correspond to the horizon(tc−h, tc).More details can be found in [6].

Once the micro-clusters for a particular time horizon have been determined,they are utilized to determine the classification accuracy of that particular hori-zon. This process is executed periodically in order to adjust for the changeswhich have occurred in the stream in recent time periods. For this purpose,we use the horizon fitting stream segment. The lastkfit points which havearrived in the horizon fitting stream segment are utilized in order to test theclassification accuracy of that particular horizon. The value ofkfit is chosenwhile taking into consideration the computational complexity of the horizonaccuracy estimation. In addition, the value ofkfit should be small enough sothat the points in it reflect the immediate locality oftc. Typically, the value ofkfit should be chosen in such a way that the least recent point should be nolarger than a pre-specified number of time units from the current timetc. Let usdenote this set of points byQfit. Note that sinceQfit is a part of the trainingstream, the class labels are known a-priori.

In order to test the classification accuracy of the process, each pointX ∈ Qfit

is used in the following nearest neighbor classification procedure:

•We find the closest micro-cluster inN (tc, h) toX.•We determine the class label of this micro-cluster and compare it to the trueclass label ofX. The accuracy over all the points inQfit is then determined.This provides the accuracy over that particular time horizon.

The accuracy of all the time horizons which are tracked by the geometrictime frame are determined. Thep time horizons which provide the greatestdynamic classification accuracy (using the lastkfit points) are selected for theclassification of the stream. Let us denote the corresponding horizon valuesbyH = h1 . . . hp. We note that sincekfit represents only a small localityof the points within the current time periodtc, it would seem at first sightthat the system would always pick the smallest possible horizons in order tomaximize the accuracy of classification. However, this is often not the casefor evolving data streams. Consider for example, a data stream in which therecords for a given class arrive for a period, and then subsequently start arrivingagain after a time interval in which the records for another class have arrived.In such a case, the horizon which includes previous occurrences of the sameclass is likely to provide higher accuracy than shorter horizons. Thus, such asystem dynamically adapts to the most effective horizon for classification ofdata streams. In addition, for a stable stream the system is also likely to picklarger horizons because of the greater accuracy resulting from use of larger datasizes.


The classification of the test stream is a separate process which is executedcontinuously throughout the algorithm. For each given test instanceXt, theabove described nearest neighbor classification process is applied using eachhi ∈ H. It is often possible that in the case of a rapidly evolving data stream,different horizons may report result in the determination of different class labels.The majority class among thesep class labels is reported as the relevant class.More details on the technique may be found in [7].

5. Other Applications of Micro-clustering and ResearchDirections

While this paper discusses two applications of micro-clustering, we note thata number of other problems can be handled with the micro-clustering approach.This is because the process of micro-clustering creates a summary of the datawhich can be leveraged in a variety of ways for other problems in data mining.Some examples of such problems are as follows:

Privacy Preserving Data Mining: In the problem of privacy preservingdata mining, we create condensed representations [3] of the data whichshow k-anonymity. These condensed representations are like micro-clusters, except that each cluster has a minimum cardinality thresholdon the number of data points in it. Thus, each cluster contains at leastk data-points, and we ensure that the each record in the data cannot bedistinguished from at leastk other records. For this purpose, we onlymaintain the summary statistics for the data points in the clusters asopposed to the individual data points themselves. In addition to the firstand second order moments we also maintain the covariance matrix forthe data in each cluster. We note that the covariance matrix providesa complete overview of the distribution of in the data. This covariancematrix can be used in order to generate the pseudo-points which matchthe distribution behavior of the data in each micro-cluster. For relativelysmall micro-clusters, it is possible to match the probabilistic distributionin the data fairly closely. The pseudo-points can be used as a surrogate forthe actual data points in the clusters in order to generate the relevant datamining results. Since the pseudo-points match the original distributionquite closely, they can be used for the purpose of a variety of data miningalgorithms. In [3], we have illustrated the use of the privacy-preservingtechnique in the context of the classification problem. Our results showthat the classification accuracy is not significantly reduced because of theuse of pseudo-points instead of the individual data points.

Query Estimation: Since micro-clusters encode summary informationabout the data, they can also be used for query estimation . A typicalexample of such a technique is that of estimating the selectivity of queries.


In such cases, the summary statistics of micro-clusters can be used inorder to estimate the number of data points which lie within a certaininterval such as a range query. Such an approach can be very efficientin a variety of applications since voluminous data streams are difficult touse if they need to be utilized for query estimation. However, the micro-clustering approach can condense the data into summary statistics, so thatit is possible to efficiently use it for various kinds of queries. We notethat the technique is quite flexible as long as it can be used for differentkinds of queries. An example of such a technique is illustrated in [9], inwhich we use the micro-clustering technique (with some modificationson the tracked statistics) for futuristic query processing in data streams.

Statistical Forecasting:Since micro-clusters contain temporal and con-densed information, they can be used for methods such as statisticalforecasting of streams . While it can be computationally intensive touse standard forecasting methods with large volumes of data points, themicro-clustering approach provides a methodology in which the con-densed data can be used as a surrogate for the original data points. Forexample, for a standard regression problem, it is possible to use the cen-troids of different micro-clusters over the various temporal time frames inorder to estimate the values of the data points. These values can then beused for making aggregate statistical observations about the future. Wenote that this is a useful approach in many applications since it is oftennot possible to effectively make forecasts about the future using the largevolume of the data in the stream. In [9], it has been shown how to use thetechnique for querying and analysis of future behavior of data streams.

In addition, we believe that the micro-clustering approach is powerful enoughto accomodate a wide variety of problems which require information about thesummary distribution of the data. In general, since many new data miningproblems require summary information about the data, it is conceivable that themicro-clustering approach can be used as a methodology to store condensedstatistics for general data mining and exploration applications.

6. Performance Study and Experimental Results

All of our experiments are conducted on a PC with Intel Pentium III processorand 512 MB memory, which runs Windows XP professional operating system.For testing the accuracy and efficiency of the CluStream algorithm, we compareCluStream with the STREAM algorithm [17, 23], the best algorithm reportedso far for clustering data streams. CluStream is implemented according to thedescription in this paper, and the STREAM K-means is done strictly accordingto [23], which shows better accuracy than BIRCH [24]. To make the comparisonfair, both CluStream and STREAM K-means use the same amount of memory.


Specifically, they use the same stream incoming speed, the same amount ofmemory to store intermediate clusters (called Micro-clusters in CluStream), andthe same amount of memory to store the final clusters (called Macro-clustersin CluStream).

Because the synthetic datasets can be generated by controlling the numberof data points, the dimensionality, and the number of clusters, with differentdistribution or evolution characteristics, they are used to evaluate the scalabilityin our experiments. However, since synthetic datasets are usually rather dif-ferent from real ones, we will mainly use real datasets to test accuracy, clusterevolution, and outlier detection.Real datasets.First, we need to find some real datasets that evolve significantlyover time in order to test the effectiveness of CluStream. A good candidate forsuch testing is the KDD-CUP’99 Network Intrusion Detection stream data setwhich has been used earlier [23] to evaluate STREAM accuracy with respectto BIRCH. This data set corresponds to the important problem of automaticand real-time detection of cyber attacks. This is also a challenging problemfor dynamic stream clustering in its own right. The offline clustering algo-rithms cannot detect such intrusions in real time. Even the recently proposedstream clustering algorithms such as BIRCH and STREAM cannot be very ef-fective because the clusters reported by these algorithms are all generated fromthe entire history of data stream, whereas the current cases may have evolvedsignificantly.

The Network Intrusion Detection dataset consists of a series of TCP con-nection records from two weeks of LAN network traffic managed by MITLincoln Labs. Eachn record can either correspond to a normal connection, oran intrusion or attack. The attacks fall into four main categories: DOS (i.e.,denial-of-service), R2L (i.e., unauthorized access from a remote machine), U2R(i.e., unauthorized access to local superuser privileges), and PROBING (i.e.,surveillance and other probing). As a result, the data contains a total of fiveclusters including the class for “normal connections”. The attack-types arefurther classified into one of 24 types, such as buffer-overflow, guess-passwd,neptune, portsweep, rootkit, smurf, warezclient, spy, and so on. It is evidentthat each specific attack type can be treated as a sub-cluster. Most of thecon-nections in this dataset arenormal, but occasionally there could be a burst ofattacks at certain times. Also, each connection record in this dataset contains42 attributes, such as duration of the connection, the number of data bytes trans-mitted from source to destination (and vice versa), percentile of connectionsthat have “SYN” errors, the number of “root” accesses, etc. As in [23], all 34continuous attributes will be used for clustering and one outlier point has beenremoved.

Second, besides testing on the rapidly evolving network intrusion data stream,we also test our method over relatively stable streams. Since previously re-


ported stream clustering algorithms work on the entire history of stream data,we believe that they should perform effectively for some data sets with stabledistribution over time. An example of such a data set is the KDD-CUP’98Charitable Donation data set. We will show that even for such datasets, theCluStream can consistently beat the STREAM algorithm.

The KDD-CUP’98 Charitable Donation data set has also been used in eval-uating several one-scan clustering algorithms, such as [16]. This data set con-tains 95412 records of information about people who have made charitabledonations in response to direct mailing requests, and clustering can be used togroup donors showing similar donation behavior. As in [16], we will only use56 fields which can be extracted from the total 481 fields of each record.Thisdata set is converted into a data stream by taking the data input order as theorder of streaming and assuming that they flow-in with a uniform speed.Synthetic datasets. To test the scalability of CluStream, we generate somesynthetic datasets by varying base size from 100K to 1000K points, the numberof clusters from 4 to 64, and the dimensionality in the range of 10 to 100.Because we know the true cluster distribution a priori, we can compare theclusters found with the true clusters. The data points of each synthetic datasetwill follow a series of Gaussian distributions, and to reflect the evolution of thestream data over time, we change the mean and variance of the current Gaussiandistribution every 10K points in the synthetic data generation.

The quality of clustering on the real data sets was measured using the sumof square distance (SSQ), defined as follows. Assume that there are a total ofN points in the past horizon at current timeTc. For each pointpi, we find thecentroidCpi of its closest macro-cluster, and computed(pi, Cpi), the distancebetweenpi andCpi . Then theSSQ at timeTc with horizonH (denoted asSSQ(Tc, H)) is equal to the sum ofd2(pi, Cpi) for all theN points within theprevious horizonH. Unless otherwise mentioned, the algorithm parameterswere set atα = 2, l = 10, InitNumber = 2000, andt = 2.

We compare the clustering quality of CluStream with that of STREAM fordifferent horizons at different times using the Network Intrusion dataset and theCharitable donation data set. The results are illustrated in Figures 2.4 and 2.5.We run each algorithm 5 times and compute their average SSQs. The resultsshow that CluStream is almost always better than STREAM. All experimentsfor these datasets have shown that CluStream has substantially higher qualitythan STREAM. However the Network Intrusion data set showed significantlybetter results than the charitable donation data set because of the fact the networkintrusion data set was a highly evolving data set. For such cases, the evolutionsensitive CluStream algorithm was much more effective than the STREAMalgorithm.

We also tested the accuracy of theOn-Demand Stream Classifier. The firsttest was performed on the Network Intrusion Data Set. The first experiment


1.00E+00 1.00E+02 1.00E+04 1.00E+06 1.00E+08 1.00E+10 1.00E+12 1.00E+14 1.00E+16

750 1250 1750 2250

Stream (in time units)

Ave

rag

e S

SQ

CluStream STREAM

Figure 2.4. Quality comparison (Network Intrusion dataset, horizon=256, streamspeed=200)

0.00E+00

5.00E+06

1.00E+07

1.50E+07

2.00E+07

2.50E+07

3.00E+07

50 150 250 350 450

Stream (in time units)

Ave

rag

e S

SQ

CluStream STREAM

Figure 2.5. Quality comparison (Charitable Donation dataset, horizon=4, streamspeed=200)


90

95

100

500 1000 1500 2000 2500 Stream (in time units)

Acc

ura

cy %

On Demand Stream Fixed Sliding Window Entire Dataset

Figure 2.6. Accuracy comparison (Network Intrusion dataset, streamspeed=80,buffer size=1600,kfit=80,init number=400)

0.01

0.1

1

10

100

0.25 0.5 1 2 4 8 16 32 Best horizon

Per

cen

tag

e %

Stream speed =80 points per time unit

Figure 2.7. Distribution of the (smallest) best horizon (Network Intrusion dataset, Timeunits=2500, buffersize=1600,kfit=80,init number=400)

80

85

500 1000 1500 2000 Stream (in time units)

Acc

ura

cy %

On Demand Stream Fixed Sliding Window Entire Dataset

Figure 2.8. Accuracy comparison (Synthetic dataset B300kC5D20, streamspeed=100,buffer size=500,kfit=25,init number=400)


0.01

0.1

1

10

100

0.25 0.5 1 2 4 8 Best horizon

Per

cen

tag

e %

Stream speed =100 points per time unit

Figure 2.9. Distribution of the (smallest) best horizon (Synthetic dataset B300kC5D20, Timeunits=2000, buffersize=500,kfit=25,init number=400)

was conducted with a stream speed at 80 connections per time unit (i.e., thereare 40 training stream points and 40 test stream points per time unit). Weset the buffersize at 1600 points, which means upon receiving 1600 points(including both training and test stream points) we’ll use a small set of thetraining data points (In this casekfit =80) to choose the best horizon. Wecompared the accuracy of theOn-Demand-Stream classifierwith two simpleone-pass stream classifiers over the entire data stream and the selected slidingwindow (i.e., sliding windowH = 8). Figure 2.6 shows the accuracy comparisonamong the three algorithms. We can see theOn-Demand-Stream classifierconsistently beats the two simple one-pass classifiers. For example, at time unit2000, theOn-Demand-Stream classifier’s accuracy is about 4% higher than theclassifier with fixed sliding window, and is about 2% higher than the classifierwith the entire dataset. Because the class distribution of this dataset evolvessignificantly over time, either the entire dataset or a fixed sliding window maynot always capture the underlying stream evolution nature. As a result, theyalways have a worse accuracy than theOn-Demand-Stream classifierwhichalways dynamically chooses the best horizon for classifying.

Figure 2.7 shows the distribution of the best horizons (They are the smallestones if there exist several best horizons at the same time). Although about 78.4%of the (smallest) best horizons have a value 1/4, there do exist about 21.6% besthorizons ranging from 1/2 to 32 (e.g., about 6.4% of the best horizons have avalue 32). This also illustrates that there is no fixed sliding window that canachieve the best accuracy and the reason why theOn-Demand-Stream classifiercan outperform the simple one-pass classifiers over either the entire dataset ora fixed sliding window.

We have also generated one synthetic dataset B300kC5D20 to test the clas-sification accuracy of these algorithms. This dataset contains 5 class labelsand300K data points with 20 dimensions. We first set the stream speed at 100 points


1000

1200

1400

1600

1800

2000

10 15 20 25 30 35 40 45 50

Num

ber

of p

oint

s pr

oces

sed

per

seco

nd

Elapsed time (in seconds)

CluStreamSTREAM

Figure 2.10. Stream Proc. Rate (Charit. Donation data, streamspeed=2000)

1500

2000

2500

3000

3500

4000

4500

5000

5500

6000

10 15 20 25 30 35 40 45 50 55 60

Num

ber

of p

oint

s pr

oces

sed

per

seco

nd

Elapsed time (in seconds)

CluStreamSTREAM

Figure 2.11. Stream Proc. Rate (Ntwk. Intrusion data, streamspeed=2000)

per time unit. Figure 2.8 shows the accuracy comparison among the three al-gortihms: TheOn-Demand-Stream classifieralways has much better accuracythan the other two classifiers. Figure 2.9 shows the distribution of the (small-est) best horizons which can explain very well why theOn-Demand-Streamclassifierhas better accuracy.

We also tested the efficiency of the micro-cluster maintenance algorithmwith respect to STREAM on the real data sets. We note that this maintenanceprocess needs to be performed both for the clustering and classificiation algo-rithms with minor differences. Therefore, we present the results for the caseof clustering. By setting the number of micro-clusters to 10 times the numberof natural clusters, Figures 2.10 and 2.11 show the stream processing rate (i.e.,


0

50

100

150

200

250

300

350

400

450

500

10 20 30 40 50 60 70 80

runt

ime

(in s

econ

ds)

Number of dimensions

B400C20B200C10B100C5

Figure 2.12. Scalability with Data Dimensionality (streamspeed=2000)

0

50

100

150

200

250

300

350

400

450

500

5 10 15 20 25 30 35 40

runt

ime

(in s

econ

ds)

Number of clusters

B400D40B200D20B100D10

Figure 2.13. Scalability with Number of Clusters (streamspeed=2000)


the number of points processed per second) as opposed to the running timefor two real data sets. Since CluStream requires some time to compute theinitial set of micro-clusters, its precessing rate is lower than STREAM at thevery beginning. However, once steady state is reached, CluStream becomesfaster than STREAM in spite of the fact that it needs to store the snapshotsto disk periodically. This is because STREAM takes a few iterations to makek-means clustering converge, whereas CluStream just needs to judge whethera set of points will be absorbed by the existing micro-clusters and insert intothem appropriately.

The key to the success of micro-cluster maintenance is high scalability. Thisis because this process is exposed to a potentially large volume of incomingdata and needs to be implemented in an efficient and online fashion. The mosttime-consuming and frequent operation during micro-cluster maintenance isthat of finding the closest micro-cluster for each newly arrived data point. It isclear that the complexity of this operation increases linearly with the number ofmicro-clusters. It is also evident that the number of micro-clusters maintainedshould be sufficiently larger than the number of input clusters in the data inorder to obtain a high quality clustering. While the number of input clusterscannot be known a priori, it is instructive to examine the scalability behaviorwhen the number of micro-clusters was fixed at a constant large factor ofthenumber of input clusters. Therefore, for all the experiments in this section, wewill fix the number of micro-clusters to 10 times the number of input clusters.We will present the scalability behavior of the CluStream algorithm with datadimensionality, and the number of natural clusters.

The first series of data sets were generated by varying the dimensionalityfrom 10 to 80, while fixing the number of points and input clusters. The firstdata set series B100C5 indicates that it contains 100K points and 5 clusters. Thesame notational convention is used for the second data set series B200C10 andthe third one B400C20. Figure 2.12 shows the experimental results, from whichone can see that CluStream has linear scalability with data dimensionality. Forexample, for dataset series B400C20, when the dimensionality increases from10 to 80, the running time increases less than 8 times from 55 seconds to 396seconds.

Another three series of datasets were generated to test the scalability againstthe number of clusters by varying the number of input clusters from 5 to 40,while fixing the stream size and dimensionality. For example, the first dataset series B100D10 indicates it contains 100K points and 10 dimensions. Thesame convention is used for the other data sets. Figure 2.13 demonstrates thatCluStream has linear scalability with the number of input clusters.


7. Discussion

In this paper, we have discussed effective and efficient methods for clusteringand classification of data streams. The techniques discussed in this paper utilizea micro-clustering approach in conjunction with a pyramidal time window. Thetechnique can be used to cluster different kinds of data streams, as well ascreate a classifier for the data. The methods have clear advantages overrecenttechniques which try to cluster the whole stream at one time rather than viewingthe stream as a changing process over time. The CluStream model providesawide variety of functionality in characterizing data stream clusters over differenttime horizons in an evolving environment.

This is achieved through a careful division of labor between the online sta-tistical data collection component and an offline analytical component. Thus,the process provides considerable flexibility to an analyst in a real-time andchanging environment. In order to achieve these goals, we needed to the designthe statistical storage process of the online component very carefully. The useof apyramidal time windowassures that the essential statistics ofevolvingdatastreams can be captured without sacrificing the underlyingspace- and time-efficiencyof the stream clustering process.

The essential idea behind the CluStream model is to perform effective datasummarization so that the underlying summary data can be used for a host oftasks such as clustering and classification. Therefore, the technique provides aframework upon which many other data mining tasks can be built.

Notes1. Without loss of generality, we can assume that one unit of clock time is the smallest level of granularity.

Thus, the0-th order snapshots measure the time intervals at the smallest level of granularity.2. If the micro-cluster contains fewer than2 · m points, then we simply find the average timestamp of

all points in the cluster.

3. The mean is equal toCF1t/n. The standard deviation is equal top

CF2t/n − (CF1t/n)2.

References

[1] Aggarwal C., Procopiuc C., Wolf J., Yu P., Park J.-S. (1999). Fastalgorithmsfor projected clustering.ACM SIGMOD Conference.

[2] Aggarwal C., Yu P. (2000). Finding Generalized Projected Clustersin HighDimensional Spaces,ACM SIGMOD Conference.

[3] Aggarwal C., Yu P.. (2004). A Condensation Approach to PrivacyPreserv-ing Data Mining.EDBT Conference.

[4] Agrawal R., Gehrke J., Gunopulos D., Raghavan P (1998). AutomaticSub-space Clustering of High Dimensional Data for Data Mining Applications.ACM SIGMOD Conference.


[5] Aggarwal C (2003). A Framework for Diagnosing Changes in EvolvingData Streams. ACM SIGMOD Conference.


[7] Aggarwal C, Han J., Wang J., Yu P. (2004). On-Demand Classification ofEvolving Data Streams.ACM KDD Conference.

[8] Aggarwal C., Han J., Wang J., Yu P. (2004). A Framework for ProjectedClustering of High Dimensional Data Streams.VLDB Conference.

[9] Aggarwal C. (2006) on Futuristic Query Processing in Data Streams.EDBTConference.

[10] Ankerst M., Breunig M., Kriegel H.-P., Sander J. (1999). OPTICS: Order-ing Points To Identify the Clustering Structure.ACM SIGMOD Conference.

[11] Babcock B., Babu S., Datar M., Motwani R., Widom J. (2002). Modelsand Issues in Data Stream Systems,ACM PODS Conference.

[12] Bradley P., Fayyad U., Reina C. (1998) Scaling Clustering Algorithms toLarge Databases.SIGKDD Conference.

[13] Cortes C., Fisher K., Pregibon D., Rogers A., Smith F. (2000). Hancock:A Language for Extracting Signatures from Data Streams. ACM SIGKDDConference.

[14] Domingos P., Hulten G. (2000). Mining High-Speed Data Streams.ACMSIGKDD Conference.

[15] Duda R., Hart P (1973).Pattern Classification and Scene Analysis, Wiley,New York.

[16] Farnstrom F„ Lewis J., Elkan C. (2000). Scalability for Clustering Algo-rithms Revisited.SIGKDD Explorations, 2(1):pp. 51–57.

[17] Guha S., Mishra N., Motwani R., O’Callaghan L. (2000). Clustering DataStreams.IEEE FOCS Conference.

[18] Guha S., Rastogi R., Shim K. (1998). CURE: An Efficient ClusteringAlgorithm for Large Databases.ACM SIGMOD Conference.


[20] Jain A., Dubes R. (1998). Algorithms for Clustering Data,Prentice Hall,New Jersey.

[21] Kaufman L., Rousseuw P. (1990). Finding Groups in Data- An Introduc-tion to Cluster Analysis.Wiley Series in Probability and Math. Sciences.

[22] Ng R., Han J (1994). Efficient and Effective Clustering Methods for SpatialData Mining.Very Large Data Bases Conference.


[23] O’Callaghan L., Mishra N., Meyerson A., Guha S., Motwani R (2002).Streaming-Data Algorithms For High-Quality Clustering.ICDE Confer-ence.

[24] Zhang T., Ramakrishnan R., and Livny M (1996). BIRCH: An EfficientData Clustering Method for Very Large Databases.ACM SIGMOD Con-ference.

Chapter 3

A SURVEY OF CLASSIFICATION METHODS INDATA STREAMS

Mohamed Medhat Gaber, Arkady Zaslavsky and Shonali KrishnaswamyCaulfield School of Information TechnologyMonash University,900 Dandenong Rd, Caulfield East,Melbourne VIC3145, Australia

Mohamed.Medhat.Gaber, Arkady.Zaslavsky, [email protected]

Abstract With the advance in both hardware and software technologies, automated datageneration and storage has become faster than ever. Such data is referred to as datastreams. Streaming data is ubiquitous today and it is often a challenging task tostore, analyze and visualize such rapid large volumes of data. Most conventionaldata mining techniques have to be adapted to run in a streaming environment,because of the underlying resource constraints in terms of memory andrunningtime. Furthermore, the data stream may often show concept drift, because ofwhich adaptation of conventional algorithms becomes more challenging. Onesuch important conventional data mining problem is that of classification. In theclassification problem, we attempt to model the class variable on the basis ofoneor more feature variables. While this problem has been extensively studied froma conventional mining perspective, it is a much more challenging problemin thedata stream domain. In this chapter, we will re-visit the problem of classificationfrom the data stream perspective. The techniques for this problem needto bethoroughly re-designed to address the issue of resource constraints and conceptdrift. This chapter reviews the state-of-the-art techniques in the literaturealongwith their corresponding advantages and disadvantages.

1. Introduction

Classification problems [19, 20] have been studied thoroughly as a major cat-egory of the data analysis tasks in machine learning, statistical inference [18]and data mining. Classification methods represent the set of supervised learningtechniques where a set of dependent variables needs to be predicted based onanother set of input attributes. There are two main distinctive approachesunder


the supervised learning category: classification and regression. Classificationis mainly concerned with categorical attributes as dependent variables; howeverregression is concerned with numerical attributes as its output. The classifica-tion process is divided into two phases: model building and model testing. Inmodel building, a learning algorithm runs over a data set to induce a model thatcould be used in estimating an output. The quality of this estimation is assessedin the model testing phase. The model building process is referred to as train-ing as well. Classification techniques [19, 20] have attracted the attention ofresearchers due to the significance of their applications. A variety of methodssuch as decision trees, rule based methods, and neural networks are used for theclassification problem. Many of these techniques have been designed to buildclassification models from static data sets where several passes over the storeddata is possible. This is not possible in the case of data streams, in which it isnecessary to process the entire data set in one pass. Furthermore, the classifi-cation problem needs to be re-designed in the context ofconcept drift, a uniqueproblem in the case of data streams.

The applications of data stream classification can vary from critical astro-nomical and geophysical applications [6] to real-time decision support in busi-ness and industrial applications [24, 25]. There are several potentialscenariosfor such applications. For example, classification and analysis of biosensormeasurements around a city for security reasons is an important emerging ap-plication. The analysis of simulation results and on-board sensor readingsin scientific applications has its potential in changing the mission plan or theexperimental settings in real time. Web log and clickstream analysis is an im-portant application in the electronic commerce domain. The classification ofdata streams generated from the marketplace such as stock market streaminginformation is another appealing application. Decision trees created from stockmarket data in distributed streaming environment have been used in MobiMine[24, 25].

The process of adapting classification models to many of the above appli-cations is often non-trivial. The most important challenge with regard to clas-sification is that of concept drifting of evolving data streams. The processofconcept driftresults from the natural tendency of the underlying data to evolveover time. The classifier is most likely to be outdated after a time window dueto the continuous change of the streaming information on a temporal basis. Wediscuss this issue along with a number of other challenges for the classificationproblem. Solution approaches used in addressing these issues are summarizedin order to emphasize their advantages, and drawbacks. This summarizationalso provides some insight for other stream mining techniques due to the sharedresearch issues across different applications. A thorough discussion of classi-fication techniques in data streams is given as a guide to researchers as well

A Survey of Classification Methods in Data Streams 41

as practitioners. The techniques are presented in an easy way with illustrativefigures depicting each algorithm in a diagrammatic way.

The chapter is organized as follows. Research issues with regard to the streamclassification problems are discussed in section 2. Section 3 represents theap-proaches proposed as solutions to address the previous research issues. Section4 provides a survey of the classification techniques in stream mining literature.Techniques surveyed include the Ensemble-based Classification [30], Very FastDecision Trees (VFDT) [9] with its extensions [22], [23], On-Demand Classifi-cation [3], On-Line Information Network (OLIN) [26], Lightweight Classifica-tion (LWClass) [14], Scalable Classification Algorithm by Learning decisiOnPatterns (SCALLOP) [12] and Adaptive Nearest Neighbor Classification forData-streams (ANNCAD) [27]. This selection of techniques is based on thesoundness of the techniques and how well the techniques addresses importantresearch challenges. Finally, the chapter is concluded with a summary in section5.

2. Research Issues

In this section, we will address the primary research issues encounteredinthe context of stream mining. While many of these issues are shared acrossdifferent stream mining applications, we discuss these issues with a specialemphasis on the problem of classification [4, 9, 10, 13, 14, 16, 17, 21, 28].

High Speed Nature of Data Streams:The inherent characteristic ofdata streams is its high speed. The algorithm should be able to adapt tothe high speed nature of streaming information. The rate of building aclassification model should be higher than the data rate. Furthermore, itis not possible to scan the data more than once. This is referred to as theone-passconstraint.

Unbounded Memory Requirements:Classification techniques requiredata to be resident in memory for building the model. The huge amountsof data streams generated rapidly dictate the need for unbounded memory.This challenge has been addressed using load shedding, sampling, aggre-gation, and creating data synopsis. The memory issue is an importantmotivation behind many of the developed techniques in the area.

Concept Drifting: Concept drifts change the classifier results over time.This is because of the change in the underlying data patterns. It is alsoreferred to as data streamevolution[1]. This results in the model becom-ing stale and less relevant over time. The capture of such changes wouldhelp in updating the classifier model effectively. The use of an outdatedmodel could lead to a very low classification accuracy.


Tradeoff between Accuracy and Efficiency:The main tradeoff in datastream mining algorithms is between the accuracy of the output with re-gard to the application and the time and space complexity. In many cases,approximation algorithms can guarantee error bounds, while maintaininga high level of efficiency.

Challenges in Distributed Applications: A significant number of datastream applications run in mobile environments with limited bandwidthsuch as sensor networks and handheld devices. Thus knowledge structurerepresentation is an important issue. After extracting models and patternslocally from data stream generators or receivers, it is important to transferthe data mining output to the user. The user could be a mobile user ora stationary one getting the results from mobile nodes. This is often achallenge because of the bandwidth limits in transferring data. Karguptaet al. [24] have addressed this problem by using Fourier transformationsto efficiently represent decision trees for the purpose of transmission overlimited bandwidth links.

Visualization of data stream mining results: Visualization of tradi-tional data mining results on a desktop has been a research issue for morethan a decade. Visualization of mining results in small screens of a Per-sonal Digital Assistant (PDA) for example is a real challenge and an openresearch problem. Given a scenario for a businessman on a move andthe data are being streamed and analyzed on his PDA. The results of thisanalysis should be efficiently visualized in a way that allows him to take aquick decision. The pioneering work on representation of decision treesin a mobile device has been suggested by Kargupta et al [24].

Modelling change of mining results over time:In some cases, the useris not interested in mining data stream results, but how these results arechanging over a temporal basis. The classification changes could help inunderstanding the change in data streams over time.

Interactive Mining environment to satisfy user results: Mining datastreams is a highly application oriented field. For example, the usershould be able to change the classification parameters to serve the specialneeds of the user under the current context. The fast nature of data streamsoften makes it more difficult to incorporate user-interaction.

The integration of data stream management systems and data streammining approaches: The integration among storage, querying, miningand reasoning of the incoming stream would realize robust streamingsystems that could be used in different applications [5, 7]. Along thisline, current database management systems have achieved this goal over


static stored data sets. However, this goal has not been fully realized forthe case of data streams. An important future research issue is to integratethe stream mining algorithms with known stream management systemsin order to design complete systems for stream processing.

Hardware and other Technological Issues:The technological issueof mining data streams is an important one. How do we represent thedata in such an environment in a compressed way? Which platformsare best suited such special real-time applications? Hardware issuesare of special concerns. Small devices generating data streams are notdesigned for complex computations. Currently emulators are used forsuch tasks and it is a real burden for data stream mining applications whichrun in resource-constrained environments. Novel hardware solutionsarerequired to address this issue.

Real time accuracy evaluation and formalization: In many cases,resource constrained methods work with a trade-off between accuracyandefficiency of the designed method. Therefore, we need a feedback ofthecurrent achieved accuracy with relation to the available resources. Thisis needed to adjust the algorithm parameters according to the availableresources. This formalization would also help in making decisions aboutthe reliability of the output.

Among the above-mentioned issues, the first three are of special significance.Thus, we will use them as the basis for comparing different stream classificationtechniques in this chapter. We also note that many of these issues are sharedamong all mining techniques in streaming environment. The following sectionconcisely summarizes the approaches that are used as solutions addressing theabove issues.

3. Solution Approaches

Many of the afore-mentioned issues can be solved using well-establishedstatistical and computational approaches. While, specific methods for streamclassification will be discussed later, it is useful to understand the broad charac-teristics of different methods which are used to adapt conventional classificationtechniques to the case of data streams. We can categorize these solutions asdata-based and task-based ones. In data-based solutions, the idea is toexamineonly a subset of the whole data set or to transform the data vertically or hori-zontally to an approximate smaller size data representation. Such an approachallows us to utilize many known data mining techniques to the case of datastreams. On the other hand, in task based solutions, some standard algorithmicmodification techniques can be used to achieve time and space efficient solu-tions [13]. Table 3.1 shows the data-based techniques, while Table 3.2 shows


Technique Definition Pros Cons

Sampling Choosing a data Error Bounds Poor forsubset for analysis Guaranteed anomaly detection

Load Ignoring a chunk Efficient Very poor forShedding of data for queries anomaly detectionSketching Random projection Extremely May ignore

on feature set Efficient Relevant featuresSynopsis Quick Analysis Task Not sufficientStructure Transformation Independent for very

fast streamAggregation Compiling summary Analysis Task May ignore

statistics Independent Relevant features

Table 3.1. Data Based Techniques

Technique Definition Pros Cons

Approximation Algorithms with Efficient Resource adaptivityAlgorithms Error Bounds with data rates

not always possibleSliding Analyzing most General Ignores partWindow recent streams of stream

Algorithm Output Highly Resource General Cost overheadGranularity aware technique of resource aware

with memory and componentfluctuatingdata rates

Table 3.2. Task Based Techniques

the task-based techniques. Each table provides a definition, advantagesanddisadvantages of each technique.

While the methods in Tables 3.1 and 3.2 provide an overview of the broadmethods which can be used to adapt conventional methods to classification, itis more useful to study specific techniques which are expressly designedfor thepurpose of classification. In the next section, we will provide a review ofthesemethods.

4. Classification Techniques

This section reviews the state-of-the-art of data stream classification tech-niques. We have provided an overview of some of the key methods, how wellthey address the research problems discussed earlier.


4.1 Ensemble Based Classification

Wang et al. [30] have proposed a generic framework for mining conceptdrifting data streams. The framework is based on the observation that manydata stream mining algorithms do not address the issue of concept drift in theevolving data. The idea is based on using an ensemble of classification modelssuch as decision trees using C4.5, RIPPER, naïve Bayesian and others tovotefor the classification output to increase the accuracy of the predicted output.

This framework was developed to address three research challenges indatastream classification:

1. Concept Drift: The accuracy of the output of many classifiers is verysensitive to concept drifts in the evolving streams. At the same time, onedoes not want to remove excessive parts of the stream, when there is noconcept drift. Therefore, a method needs to be designed to decide whichpart of the stream to be used for the classification process.

2. Efficiency: The process of building classifiers is a complex computa-tional task and the update of the model due to concept drifts is a compli-cated process. This is especially relevant in the case of high speed datastreams.

3. Robustness:Ensemble based classification has traditionally been usedin order to improve robustness. The key idea is to avoid the problem ofoverfitting of individual classifiers. However, it is often a challengingtask to use the ensemble effectively because of the high speed nature ofthe data streams.

An important motivation behind the framework is to deal with the expiration ofold data streams. The idea of using the most recent data streams to build anduse the developed classifiers may not be valid for most applications. Althoughthe old streams can affect the accuracy of the classification model in a negativeway, it is still important to keep track of this data in the current model. Thework in [30] shows that it is possible to use weighted ensemble classifiers inorder to achieve this goal.

The work in [30] uses weighted classifier ensembles according to the currentaccuracy of each classifier used in the ensemble. The weight of any classifieris calculated and contributed to predict the final output. The weight of eachclassifier may vary as the data stream evolves, and a given classifier may be-come more or less important on a particular sequential chunk of the data. Theframework has outperformed single classifiers experimentally. This is partlybecause of the greater robustness of the ensemble, and partly becauseof moreeffective tracking of the change in the underlying structure of the data. Moreinteresting variations of similar concepts may be found in [11]. Figure 3.1depicts the proposed framework.


4.2 Very Fast Decision Trees (VFDT)

Domingos and Hulten [9, 22] have developed a decision tree approach whichis referred to asVery Fast Decision Trees (VFDT). It is a decision tree learningsystem based on Hoeffding trees. It splits the tree using the current best attributetaking into consideration that the number of examples used satisfies the Hoeffd-ing bound. Such a technique has the property that its output is (asymptotically)nearly identical to that of a conventional learner. VFDT is an extended versionof such a method which can address the research issues of data streams.Theseresearch issues are:

Ties of attributes: Such ties occur when two or more attributes have closevalues of the splitting criteria such as information gain or gini index. Wenote that at such a moment of the decision tree growth phase, one mustmake a decision between two or more attributes based on only the setof records received so far. While it is undesirable to delay such splitdecisions indefinitely, we would like to do so at a point when the errorsare acceptable.

Bounded memory: The tree can grow till the algorithm runs out of mem-ory. This results in a number of issues related to effective maintenanceof the tree.

Efficiency and Accuracy: This is an inherent characteristic of all datastream algorithms.

The extension of Hoeffding trees in VFDT has been done using the followingtechniques.

The key question during the construction of the decision tree is the choiceof attributes to be used for splits. Approximate ties on attributes arebroken using a user-specified threshold of acceptable error measureforthe output. By using this approach, a crisp criterion can be determinedon when a split (based on the inherently incomplete information fromthe current data stream) provides acceptable error. In particular, theHoeffding inequality provides the necessary bound on the correctnessofthe choice of split variable. It can be shown for any small value ofδ,that a particular choice of the split variable is the correct choice (sameas conventional learner) with probability at least1 − δ, if a sufficientnumber of stream records have been processed. This “sufficient number”increases at the relatively modest rate of log(1/δ). The bound on theaccuracy of each split can then be extrapolated to the behavior of theentire decision tree. We note that the stream decision tree will providethe same result as the conventional decision tree, if for every node alongthe path for given test instance, the same choice of split is used. This


can be used to show that the behavior of the stream decision tree for aparticular test instance differs from the conventional decision tree withprobability at most1 − δ/p, wherep is the probability that a record isassigned to a leaf at each level.

Bounded memory has been addressed by de-activating the least promisingleaves and ignoring the poor attributes. The calculation of these poorattributes is done through the difference between the splitting criteria ofthe highest and lowest attributes. If the difference is greater than a pre-specified value, the attribute with the lowest splitting measure will befreed from memory.

The VFDT system is inherently I/O bound; in other words, the time forprocessing the example is lower than the time required to read it fromdisk. This is because of the Hoeffding tree-based approach with a crispcriterion for tree growth and splits. Such an approach can make cleardecisions at various points of the tree construction algorithm withouthaving to re-scan the data. Furthermore, the computation of the splittingcriteria is done in a batch processing mode rather than online processing.This significantly saves the time of recalculating the criteria for all theattributes with each incoming record of the stream. The accuracy of theoutput can be further improved using multiple scans in the case of lowdata rates.

All the above improvements have been tested using special synthetic data sets.The experiments have proved efficiency of these improvements. Figure 3.2depicts the VFDT learning system. The VFDT has been extended to addressthe problem of concept drift in evolving data streams. The new frameworkhas been termed as CVFDT [22]. It runs VFDT over fixed sliding windowsinorder to have the most updated classifier. The change occurs when the splittingcriteria changes significantly among the input attributes.

Jin and Agrawal [23] have extended the VFDT algorithm to efficiently pro-cess numerical attributes and reduce the sample size calculated using the Ho-effding bound. The former objective has been addressed using their Numeri-cal Interval Pruning (NIP) technique. The pruning is done by first creating ahistogram for each interval of numbers. The least promising intervals to bebranched are pruned to reduce the memory space. The experimental resultsshow an average of39% of space reduction by using NIP. The reduction ofsample size is done by using properties of information gain functions. Thederived method using multivariate delta method has a guarantee of a reductionof sample size over the Hoeffding inequality with the same accuracy. The ex-periments show a reduction of37% of the sample size by using the proposedmethod.


4.3 On Demand Classification

Aggarwal et al. have adopted the idea of micro-clusters introduced in CluS-tream [2] in On-Demand classification in [3]. The on-demand classificationmethod divides the classification approach into two components. One compo-nent continuously stores summarized statistics about the data streams and thesecond one continuously uses the summary statistics to perform the classifica-tion. The summary statistics are represented in the form of class-label specificmicro-clusters. This means that each micro-cluster is associated with a specificclass label which defines the class label of the points in it. We note that bothcomponents of the approach can be used in online fashion, and therefore theapproach is referred to as anon-demand classification method. This is becausethe set of test instances could arrive in the form of a data stream and canbeclassified efficiently on demand. At the same time, the summary statistics (andtherefore training model) can be efficiently updated whenever new data arrives.The great flexibility of such an approach can be very useful in a varietyofapplications.

At any given moment in time, the current set of micro-clusters can be used toperform the classification. The main motivation behind the technique is that theclassification model should be defined over a time horizon which depends onthe nature of the concept drift and data evolution. When there is smaller conceptdrift, we need a larger time horizon in order to ensure robustness. In the eventof greater concept drift, we require smaller time horizons. One key property ofmicro-clusters (referred to as thesubtractive property) ensures that it is possibleto compute horizon-specific statistics. As a result it is possible to performthe classification over a wide variety of time horizons. A hold out trainingstream is used to decide the size of the horizon on which the classification isperformed. By using a well-chosen horizon it is possible to achieve a highlevel of classification accuracy. Figure 3.3 depicts the classification on demandframework.

4.4 Online Information Network (OLIN)

Last [26] has proposed an online classification system which can adapttoconcept drift. The system re-builds the classification model with the most recentexamples. By using the error-rate as a guide to concept drift, the frequency ofmodel building and the window size is adjusted over time.

The system uses info-fuzzy techniques for building a tree-like classificationmodel. It uses information theory to calculate the window size. The main ideabehind the system is to change the sliding window of the model reconstructionaccording to the classification error rate. If the model is stable, the window sizeincreases. Thus the frequency of model building decreases. The info-fuzzytechnique for building a tree-like classification model is referred to as the Info-


A1 A2 . . . An Class Weight

V alue(A1) V alue(A2) . . . V alue(An) Class X = # itemsCategory Contributing

. . . . . . . . . . . . . . . . . .

Table 3.3. Typical LWClass Training Results

Fuzzy Network (IFN). The tree is different than conventional decisiontrees inthat each level of the tree represents only one attribute except the root node layer.The nodes represent different values of the attribute. The process of inducingthe class label is similar to the one of conventional decision trees. The processof constructing this tree has been termed as Information Network (IN). TheINtechnique uses a similar procedure of building conventional decision treesbydetermining if the split of an attribute would decrease the entropy or not. Themeasure used is mutual conditional information that assesses the dependencybetween the current input attribute under examination and the output attribute.At each iteration, the algorithm chooses the attribute with the maximum mutualinformation and adds a layer with each node represents a different valueof thisattribute. The iterations stop once there is no increase in the mutual informationmeasure for any of the remaining attributes that have not been consideredin thetree. OLIN system repeatedly uses the IN algorithm for building a new classifi-cation model. The system uses the information theory to calculate the windowsize (refers to number of examples). It uses a less conservative measure thanHoeffding bound used in VFDT [9, 22] reviewed earlier in this chapter. Thismeasure is derived from the mutual conditional information in the IN algorithmby applying the likelihood ratio test to assess the statistical significance of themutual information. Subsequently, we change the window size of the modelreconstruction according to the classification error rate. The error rateis cal-culated by measuring the difference between the error rate during the trainingat one hand and the error rate during the model validation at the other hand. Asignificance increase in the error rate indicates a high probability of a conceptdrift. The window size changes according to the value of this increase. Figure3.4 shows a simple flow chart of the OLIN system.

4.5 LWClass Algorithm

Gaber et al [14] have proposed Lightweight Classification techniques termedas LWClass. LWClass is based on Algorithm Output Granularity. The algo-rithm output granularity (AOG) introduces the first resource-aware data analy-sis approach that can cope with fluctuating data rates according to the availablememory and the processing speed. The AOG performs the local data analysison resource constrained devices that generate or receive streams ofinforma-


tion. AOG has three stages of mining, adaptation and knowledge integration asshown in Figure 3.5 [14].

LWClass starts with determining the number of instances that could be resi-dent in memory according to the available space. Once a classified data recordarrives, the algorithm searches for the nearest instance already stored in the mainmemory. This is done using a pre-specified distance threshold. This thresholdrepresents the similarity measure acceptable by the algorithm to consider twoor more data records as an entry into a matrix. This matrix is a summarizedversion of the original data set. If the algorithm finds a nearest neighbor, itchecks the class label. If the class label is the same, it increases the weightforthis instance by one, otherwise it decrements the weight by one. If the weightis decremented down to zero, this entry will be released from the memory con-serving the limited memory on streaming applications. The algorithm outputgranularity is controlled by the distance threshold value and is changing overtime to cope with the high speed of the incoming data elements. The algorithmprocedure could be described as follows:

1. Each record in the data stream contains attribute values fora1, a2,…, an

attributes and the class category.

2. According to the data rate and the available memory, the algorithm outputgranularity is applied as follows:

2.1 Measure the distance between the new record and the stored ones.

2.2 If the distance is less than a threshold, store the average of thesetwo records and increase the weight for this average as an entry by1. (The threshold value determines the algorithm accuracy and ischosen according to the available memory and data rate that de-termines the algorithm rate). This is in case that both items havethe same class category. If they have different class categories, theweight is decreased by 1 and released from memory if the weightreaches zero.

2.3 After a time threshold for the training, we come up with a matrixrepresented in Table 3.3.

3. Using Table 3.3, the unlabeled data records could be classified as follows.According to the available time for the classification process, we choosenearest K-table entries and these entries are variable according to the timeneeded by the process.

4. Find the majority class category taking into account the calculated weightsfrom the K entries. This will be the output for this classification task.


4.6 ANNCAD Algorithm

Law et al [27] have proposed an incremental classification algorithm termedas Adaptive Nearest Neighbor Classification for Data-streams (ANNCAD).The algorithm uses Haar Wavelets Transformation for multi-resolution datarepresentation. A grid-based representation at each level is used.

The process of classification starts with attempting to classify the data recordaccording to the majority nearest neighbors at finer levels. If the finer levels areunable to differentiate between the classes with a pre-specified threshold,thecoarser levels are used in a hierarchical way. To address the concept drift prob-lem of the evolving data streams, an exponential fade factor is used to decreasethe weight of old data in the classification process. Ensemble classifiers areused to overcome the errors of initial quantization of data. Figure 3.6 depictsthe ANNCAD framework.

Experimental results over real data sets have proved the achieved accuracyover the VFDT and CVFDT discussed earlier in this section. The drawbackof this technique represented in inability of dealing with sudden concept driftsas the exponential fade factor takes a while to have its effect felt. In fact,thechoice of the exponential fade factor is an inherent flexibility which could leadto over-estimation or under-estimation of the rate of concept drift. Both errorswould result in a reduction in accuracy.

4.7 SCALLOP Algorithm

Ferrer-Troyano et al. [12] have proposed a scalable classification algorithmfor numerical data streams. This is one of the few rule-based classifiers fordata streams. It is inherently difficult to construct rule based classifiers fordata streams, because of the difficulty in maintaining the underlying rule statis-tics. The algorithm has been termed as Scalable Classification Algorithm byLearning decisiOn Patterns (SCALLOP).

The algorithm starts by reading a number of user-specified labeled records.A number of rules are created for each class from these records. Subsequently,the key issue is to effectively maintain the rule set after arrival of each newrecord. On the arrival of a new record, there are three cases:

a) Positive covering: This is the case of a new record that strengthens acurrent discovered rule.

b) Possible expansion:This is the case of a new record that is associatedwith at least one rule, but is not covered by any currently discovered rule.

c) Negative covering: This is the case of a new record that weakens acurrently discovered rule.

For each of the above cases, a different procedure is used as follows:


a) Positive covering: The positive support and confidence of the existingrule is re-calculated.

b) Possible expansion:In this case, the rule is extended if it satisfies twoconditions:

– It is bounded within a user-specified growth bounds to avoid a pos-sible wrong expansion of the rule.

– There is no intersection between the expanded rule and any alreadydiscovered rule associated with the same class label.

c) Negative covering: In this case, the negative support and confidence isre-calculated. If the confidence is less than a minimum user-specifiedthreshold, a new rule is added.

After reading a pre-defined number of records, the process of rule refiningis performed. Rules in the same class and within a user-defined acceptabledistance measure are merged. At the same time, care is taken to ensure thatthese rules do not intersect with rules associated with other class labels. Theresulting hypercube of the merged rules should also be within certain growthbounds. The algorithm also has a refinement stage. This stage releases theuninteresting rules from the current model. In particular, the rules that haveless than the minimum positive support are released. Furthermore, the rulesthatare not covered by at least one of the records of the last user-defined numberof received records are released. Figure 3.7 shows an illustration of the basicprocess.

Finally a voting-based classification technique is used to classify the unla-beled records. If there is a rule covers the current record, the labelassociatedwith that rule is used as the classifier output. Otherwise, a voting over thecurrent rules within the growth bounds is used to infer the class label.

5. Summary

Stream classification techniques have several important applications in busi-ness, industry and science. This chapter reviews the research problems in datastream classification. Several approaches in the literature have been summa-rized with their advantages and drawbacks. While the selection of the tech-niques is based on the performance and quality of addressing the research chal-lenges, there are a number of other methods [11, 8, 15, 22, 31] which wehave not discussed in greater detail in this chapter. Many of these techniquesare developed along similar lines as one or more techniques presented in thischapter.

The major research challenges in data stream classification are represented inconcept drifting, resource adaptivity, high data rates, and the unbounded mem-ory requirements. While many methods have been proposed to address someof


Method Concept Drift High Speed Memory Req.

Ensemble-based XClassification

VFDT X XOn-Demand X X XClassification

Online XInformation

NetworkLWClass X X

ANNCAD XSCALLOP X

Table 3.4. Summary of Reviewed Techniques

Figure 3.1. The ensemble based classification method

these issues, they are often unable to address these issues simultaneously.Table3.4 summarizes the previously reviewed techniques in terms of addressing theabove challenges. The area of data stream classification is still in its infancy.A number of open challenges still remain in stream classification algorithms;particular in respect to concept drift and resource adaptive classification.

References

[1] Aggarwal C. (2003) A Framework for Diagnosing Changes in EvolvingData Streams.Proceedings of the ACM SIGMOD Conference.


Figure 3.2. VFDT Learning Systems

Figure 3.3. On Demand Classification


Figure 3.4. Online Information Network System

Figure 3.5. Algorithm Output Granularity


Figure 3.6. ANNCAD Framework

Figure 3.7. SCALLOP Process


[2] Aggarwal C., Han J., Wang J., Yu P. S., (2003) A Framework for ClusteringEvolving Data Streams,Proc. 2003 Int. Conf. on Very Large Data Bases(VLDB’03), Berlin, Germany, Sept. 2003.

[3] Aggarwal C., Han J., Wang J., Yu P. S., (2004) On Demand Classificationof Data Streams,Proc. 2004 Int. Conf. on Knowledge Discovery and DataMining (KDD’04), Seattle, WA.

[4] Babcock B., Babu S., Datar M., Motwani R., and Widom J. (2002) Modelsand issues in data stream systems. InProceedings of PODS.

[5] Babcock B., Datar M., and Motwani R. (2003) Load Shedding Techniquesfor Data Stream Systems (short paper) InProc. of the 2003 Workshop onManagement and Processing of Data Streams (MPDS 2003).

[6] Burl M., Fowlkes C., Roden J., Stechert A., and Mukhtar S. (1999),Di-amond Eye: A distributed architecture for image data mining, inSPIEDMKD, Orlando.

[7] Cai Y. D., Clutter D., Pape G., Han J., Welge M., Auvil L. (2004) MAIDS:Mining Alarming Incidents from Data Streams.Proceedings of the 23rdACM SIGMOD (International Conference on Management of Data).

[8] Ding Q., Ding Q, and Perrizo W., (2002) Decision Tree Classification ofSpatial Data Streams Using Peano Count Trees,Proceedings of the ACM124 Symposium on Applied Computing, Madrid, Spain, pp. 413–417.

[9] Domingos P. and Hulten G. (2000) Mining High-Speed Data Streams. InProceedings of the Association for Computing Machinery Sixth Interna-tional Conference on Knowledge Discovery and Data Mining.

[10] Dong G., Han J., Lakshmanan L. V. S., Pei J., Wang H. and Yu P. S.(2003) Online mining of changes from data streams: Research problems andpreliminary results, InProceedings of the 2003 ACM SIGMOD Workshopon Management and Processing of Data Streams.

[11] Fan W. (2004) Systematic data selection to mine concept-drifting datastreams.ACM KDD Conference, pp. 128-137.

[12] Ferrer-Troyano F. J., Aguilar-Ruiz J. S. and Riquelme J. C. (2004) Dis-covering Decision Rules from Numerical Data Streams,ACM Symposiumon Applied Computing, pp. 649-653.

[13] Gaber, M, M., Zaslavsky, A., and Krishnaswamy, S. (2005) MiningDataStreams: A Review.ACM SIGMOD Record, Vol. 34, No. 1, June 2005,ISSN: 0163-5808.

[14] Gaber, M, M., Krishnaswamy, S., and Zaslavsky, A., (2005). On-boardMining of Data Streams in Sensor Networks, Accepted as a chapter in theforthcoming bookAdvanced Methods of Knowledge Discovery from Com-plex Data, (Eds.) Sanghamitra Badhyopadhyay, Ujjwal Maulik, LawrenceHolder and Diane Cook, Springer Verlag, to appear.


[15] Gama J., Rocha R. and Medas P. (2003), Accurate Decision TreesforMining High-Speed Data Streams,Proceedings of the Ninth InternationalConference on Knowledge Discovery and Data Mining.

[16] Garofalakis M., Gehrke J., Rastogi R. (2002) Querying and mining datastreams: you only get one look a tutorial. SIGMOD Conference, 635.

[17] Golab L. and Ozsu T. M. (2003) Issues in Data Stream Management.InSIGMOD Record, Volume 32, Number 2, pp. 5–14.

[18] Hand D. J. (1999) Statistics and Data Mining: Intersecting DisciplinesACM SIGKDD Explorations, 1, 1, pp. 16-19.

[19] Hand D.J., Mannila H., and Smyth P. (2001)Principles of data mining,MIT Press.

[20] Hastie T., Tibshirani R., Friedman J. (2001)The elements of statisticallearning: data mining, inference, and prediction, New York: Springer.

[21] Henzinger M., Raghavan P. and Rajagopalan S. (1998), Computingondata streams , Technical Note 1998-011, Digital Systems Research Center,Palo Alto, CA.

[22] Hulten G., Spencer L., and Domingos P. (2001) Mining Time-ChangingData Streams.ACM SIGKDD Conference.

[23] Jin R. and Agrawal G. (2003), Efficient Decision Tree Construction onStreaming Data, inProceedings of ACM SIGKDD Conference.

[24] Kargupta, H., Park, B., Pittie, S., Liu, L., Kushraj, D. and Sarkar, K. (2002).MobiMine: Monitoring the Stock Market from a PDA.ACM SIGKDD Ex-plorations, Volume 3, Issue 2. Pages 37–46. ACM Press.

[25] Kargupta H., Bhargava R., Liu K., Powers M., Blair S., Bushra S., DullJ.,Sarkar K., Klein M., Vasa M., and Handy D. (2004) VEDAS: A Mobile andDistributed Data Stream Mining System for Real-Time Vehicle Monitoring.Proceedings of SIAM International Conference on Data Mining.

[26] Last M. (2002) Online Classification of Nonstationary DataStreams,Intelligent Data Analysis, Vol. 6, No. 2, pp. 129-147.

[27] Law Y., Zaniolo C. (2005) An Adaptive Nearest Neighbor ClassificationAlgorithm for Data Streams,Proceedings of the 9th European Confer-ence on the Principals and Practice of Knowledge Discovery in Databases,Springer Verlag, Porto, Portugal.

[28] Muthukrishnan S. (2003) Data streams: algorithms and applications.Pro-ceedings of the fourteenth annual ACM-SIAM symposium on discrete al-gorithms.

[29] Park B. and Kargupta H. (2002) Distributed Data Mining: Algorithms,Systems, and Applications. To be published in the Data Mining Handbook.Editor: Nong Ye.


[30] Wang H., Fan W., Yu P. and Han J. (2003) Mining Concept-Drifting DataStreams using Ensemble Classifiers, in the9th ACM International Confer-ence on Knowledge Discovery and Data Mining (SIGKDD), WashingtonDC, USA.

[31] Wang K., Zhou S., Fu A., Yu J. (2003) Mining changes of classification bycorrespondence tracing.SIAM International Conference on Data Mining.

Chapter 4

FREQUENT PATTERN MINING IN DATA STREAMS

Ruoming JinComputer Science DepartmentKent State University

[email protected]

Gagan AgrawalDepartment of Computer Science and EngineeringThe Ohio State University

[email protected]

Abstract Frequent pattern mining is a core data mining operation and has been exten-sively studied over the last decade. Recently, mining frequent patternsover datastreams have attracted a lot of research interests. Compared with other streamingqueries, frequent pattern mining poses great challenges due to high memory andcomputational costs, and accuracy requirement of the mining results.

In this chapter, we overview the state-of-art techniques to mine frequent pat-terns over data streams. We also introduce a new approach for this problem,which makes two major contributions. First, this one pass algorithm for frequentitemset mining has deterministic bounds on the accuracy, and does not requireany out-of-core summary structure. Second, because the one passalgorithm doesnot produce any false negatives, it can be easily extended to a two passaccuratealgorithm. The two pass algorithm is very memory efficient.

1. Introduction

Frequent pattern mining focuses on discovering frequently occurring patternsfrom different types of datasets, including unstructured ones, such as transactionand text datasets, semi-structured ones, such as XML datasets, and structuredones, such as graph datasets. The patterns can be itemsets, sequences, sub-trees, or subgraphs, etc., depending on the mining tasks and targeting datasets.Frequent patterns can not only effectively summarize the underlying datasets,


providing key sights into the data, but also serve as the basic tool for many otherdata mining tasks, including association rule mining, classification, clustering,and change detection among others [21, 37, 20, 24].

Many efficient frequent pattern algorithms have been developed in the lastdecade [1, 17, 18, 35, 26, 33, 36]. These algorithms typically require datasets tobe stored in persistent storage and involve two or more passes over the dataset.Recently, there has been much interest in data arriving in the form of continuousand infinite data streams. In a streaming environment, a mining algorithm musttake only a single pass over the data [4]. Such algorithms can only guaranteean approximate result.

Compared with other stream processing tasks, the unique challenges in dis-covering frequent patterns are in three-fold. First, frequent patternmining needsto search a space with an exponential number of patterns. The cardinality of theanswering set itself which contains all frequent patterns can be very large too.In particular, it can cost much more space to generate an approximate answer-ing set for frequent patterns in a streaming environment. Therefore, theminingalgorithm needs to be very memory-efficient. Second, frequent pattern miningrelies on the down-closure property to prune infrequent patterns and generatethe frequent ones. This process (even without the streaming constraint)is verycompute-intensive. Consequently, keeping up the pace with high- speed datastreams can be very hard for a frequent pattern-mining task. Given these chal-lenges, a more important issue is the quality of the approximate mining results.The more accurate results usually require more memory and computations.What should be the acceptable mining results to a data miner? To deal with thisproblem, a mining algorithm needs to provide users the flexibility to control theaccuracy of the final mining results.

In the last several years, several new mining algorithms have been proposedto find frequent patterns over data streams. In the next chapter, we will overviewthese new algorithms.

2. Overview

2.1 Frequent Pattern Mining: Problem Definition

Let the datasetD be a collection of objects, i.e.D = o1, o2, · · · , o|D|.Let P be the set of all possible (interesting) patterns occurring inD, g be thecounting functiong : P × O → N , whereO is the set of objects, andN isthe set of nonnegative integers. Given parametersp ∈ P , ando ∈ O, g(p, o)returns the number of timesp occurs ino. The support of a patternp ∈ P inthe datasetD is defined as

supp(p) =

j=|D|∑

j=0

I(g(p, oj))

Frequent Pattern Mining in Data Streams 63

where,I is anindicator function: if g(p, oj) > 0, I(g(p, oj)) = 1; otherwise,I(g(p, oj)) = 0. Given a support levelθ, the frequent patterns ofP inD is theset of patterns inP which have support greater than or equal to theθ.

The first and arguably the most important frequent pattern-mining task isfrequent itemsets mining, proposed by Rakesh Agrawalet.al. in 1993 [2]. Inthis setting, the objects in the datasetD are transactions or sets of items. LetItembe the set of all possible items in the datasetD. Then the datasetD canbe represented asD = I1, · · · , I|D|, whereIj ⊆ Item,∀j, 1 ≤ j ≤ |D|.The set of all possible patternsP is the power-set ofItem. Note that the set ofall possible objectsO is the same asP in this setting. The counting functiong is defined upon on the set containing (⊆) relationship. In other words, if theitemsetp is contained inIj (p ⊆ Ij), the functiong(p, Ij) returns1; otherwise,it returns0. For instance, given a datasetD=A,B,D,E, B,C,E,A,B,E,A,B,C, A,C, B,C, and a support levelθ = 50%, the frequent patternsareA, B, C, andB,C.

The majority of work in mining frequent patterns over data streams focuseson frequent itemsets mining. Many techniques developed in this setting canbe served as a basis for mining other more complicated pattern mining tasks,such as graph mining [21]. To simplify the discussion, this chapter will focuson mining frequent itemsets over data streams.

2.2 Data Streams

In a data stream, transactions arrive continuously and the volume of transac-tions can be potentially infinite. Formally, a data streamD can be defined as asequence of transactions,D = (t1, t2, · · · , ti, · · · ), whereti is thei-th arrivedtransaction. To process and mine data streams, different window models are of-ten used. A window is a subsequence betweeni-th andj-th arrived transactions,denoted asW [i, j] = (ti, ti+1, · · · , tj), i ≤ j. A user can ask different typesof frequent pattern-mining questions over different type of window models.Landmark window: In this model, we are interested in, from a starting time-pointi to the current timepointt, what are the frequent itemsets. In other words,we are trying to find the frequent itemsets over the windowW [i, t]. A specialcase of the landmark window is wheni = 1. In this case, we are interestedin the frequent itemsets over the entire data stream. Clearly, the difficulty insolving the special case is essentially the same as the more general cases, andall of them require an efficient single-pass mining algorithm. For simplicity,we will focus on the case where theEntire Data Streamis the target.

Note that in this model, we treat each time-point after the starting pointequally important. However, in many cases, we are more interested in therecenttime-points. The following two models focus on such cases:


Sliding window: Given the length of the sliding windoww and current time-pointt, we are interested in the frequent pattern time in the windowW [t−w+1, t]. As time changes, the window will keep its size and move along with thecurrent time point. In this model, we are not interested in the data which arrivedbefore the timepointt− w + 1.Damped window model: This model assigns more weights to the recentlyarrived transactions. A simple way to do that is to define adecay rate[7], anduse this rate to update the previously arrived transactions (by multiplication)as a new transaction arrives. Correspondingly, the count of an itemsetis alsodefined based on the weight of each transaction.

In the next subsection, we will overview the algorithms for mining frequentitemsets on these three different window models over data streams. In addition,we would like to point out that besides the three windows we introduced above,Jiawei Hanet. al. proposed another model calledtilted-time windowmodel. Inthis model, we are interested in frequent itemsets over a set of windows. Eachwindow corresponds to different time granularity based on their recency. Forexample, we are interested in every minute for the last hour, every five minutesfor the previous hour, every ten minutes for the hour before that. Moreover,the transactions inside each window are also weighted. Such model can allowus to pose more complicated queries over data stream. Giannellaet. al. havedeveloped a variant of FP-tree, called FP-stream, for dynamically updatingfrequent patterns on streaming data and answering the approximate frequentitemsets for even arbitrary time intervals [15].

2.3 Mining Algorithms

All Closed

Entire Data Stream Lossy Counting[28]

FPDM[34]

Sliding Window Moment[11]

Damped Window estDec[7]

Closed: Closed frequent itemsets

Table 4.1. Algorithms for Frequent Itemsets Mining over Data Streams

Table 4.1 lists the algorithms which have been proposed for mining frequentitemsets in the last several years. Note thatAll suggests to find all of the frequentitemsets given support levelθ. Closed frequent itemsets are itemsets that arefrequent but have higher frequency than all of their supersets. (Ifan itemsetp


is a subset of itemsetq, q is called the superset ofp.) In the following, we willbriefly introduce these algorithms and their basic ideas.Mining Algorithms for Entire Data Stream Manku and Motwani proposedthe first one-pass algorithm, Lossy Counting, to find all frequent itemsets overa data stream [28]. Their algorithm isfalse-positive orientedin the sense that itdoes not allow false negatives, and has a provable bound on false positives. Ituses a user-defined error parameterǫ to control the quality of the answering setfor a given support levelθ. More precisely, its answering set is guaranteed tohave all itemsets whose frequency exceedsθ, and contains no itemsets whosetrue frequency is less thanθ− ǫ. In other words, the itemsets whose frequencyare betweenθ − ǫ andθ possibly appear in the answering set, and are the falsepositives.

Recently, Yu and his colleagues proposed FPDM, which is a false-negativeoriented approach, for mining frequent itemsets over data streams [34]. Theiralgorithm does not allow false positive, and has a high-probability to find item-sets which are truly frequent. In particular, they use a user-defined parameterδto control the probability to find the frequent itemsets at support levelθ. Specif-ically, the answering set does not include any itemsets whose frequency islessthanθ, and include any itemsets whose frequency exceedsθ with probabilityof at least1− δ. It utilizes the Chernoff bound to achieve such quality controlfor the answering set.

Both algorithms logically partitioned the data stream into equally sized seg-ments, and find the potentially frequent itemsets for each segment. They ag-gregate these locally frequent itemsets and further prune the infrequentones.However, the number of transactions in each segment as well as the methodto define potentially frequent itemsets is different for these two methods. InLossy Counting, the number of transactions in a segment isk × ⌈1/ǫ⌉, and anitemset which occurs more thank times in a segment is potentially frequent.In FDPM, the number of transactions in a segment isk × n0, wheren0 is therequired number of observations in order to achieve Chernoff bound with theuser-defined parameterδ. In this case, an itemset whose frequency exceedsθ − ǫ, whereǫ is computed by Chernoff bound in terms ofδ and the numberof observations (k × n0). Note thatk is a parameter (batch size) to control thesize of each segment.

To theoretically estimate the space requirement for both algorithms, we con-sider each transaction including only a single item, and the number of transac-tions in the entire data stream is|D|. Lossy Counting will takeO(1/ǫlog(ǫ|D|))to find frequent items (1-itemsets), and FPDM-1 (the simple version of FPDMon finding frequent items) will needO((2 + 2ln(2/δ))/θ).

Note that different approaches have different advantages and disadvantages.For instance, for the false positive approach, if a second pass is allowed, we caneasily eliminate false positives. For the false negative approach, we can have


a small answering set which have almost all the frequent itemsets, but mightmiss some of them (with very small probability controlled byδ).Sliding Window Chiet al. have studied the problem on mining closed frequentitemsets over a sliding window of a data stream [12]. In particular, they assumethe width of sliding window is not very large, therefore, the transactions ofeachsliding window could be held in the main memory. Clear, such assumption isvery close to the problem setting of theincrementalassociation rule mining [10].But their focus is on how to maintain the closed frequent itemsets in an efficientway.

To deal with this problem, they proposed a new mining algorithm, calledMOMENT. It utilizes the heuristic that in most cases, the sets of frequent item-sets are relatively stable for the consecutive sliding windows in the data stream.Specifically, such stability can be expressed as the fact that the boundary be-tween the frequent itemsets and infrequent itemsets, and the boundary betweenclosed frequent itemsets and the rest of itemsets move very slowly. Therefore,instead of generating all closed frequent itemsets for each window, they focuson monitoring such boundaries. As the key of this algorithm, an in-memorydata structure, theclosed enumeration tree(CET), is developed to efficientlymonitor closed frequent itemsets as well as itemsets that form the boundarybetween the closed frequent itemsets and the rest of the itemsets. An efficientmechanism has been proposed to update the CET as the sliding window movesso that the boundary maintains for each sliding window.Damped Window ModelChang and Lee studied the problem to find recentlyfrequent itemsets over data streams using the damped window model. Specif-ically, in their model, the weight for an existing transaction in the data streamreduces by a decay factor,d, as a new transaction arrives. For example, theinitial weight of a newly arrived transaction has weight1, and after anothertransaction arrives, it will be reduced asd = (1× d).

To keep tracking down the frequent itemsets in such a setting, they propose anew algorithm,estDec, which process the transaction one by one. It maintainsa lattice for recording the potentially frequent itemsets and their counts, andupdates the lattice for each new transaction correspondingly. Note that theoret-ically, the count of each itemset in the lattice will change as a new transactionarrives. But by recording an additional information for each itemsetp, thetime-point of the most recent transaction containsp, the algorithm only needsto update the counts for the itemsets which are the subsets of newly arrivedtransaction. It will reduce their counts using the constant factord, and thenincreases them by one. Further, it inserts the subsets of the current transac-tion which are potentially frequent into the lattice. It uses a method similar toCarma [19] to estimate the frequency of these newly inserted itemsets.DiscussionAmong the above three different problem settings, we can see thatthe first one, finding the frequent itemsets over the entire data stream, is the


most challenging and fundamentally important one. It can often serve as thebasis to solve the latter two. For example, the current sliding window modelstudied in MOMENT is very similar to the incremental data mining. A moredifficult problem is the case when the data in a sliding window cannot be heldin the main memory. Clearly, in such case, we need single-pass algorithms foreven for a single sliding window. The main difference between the dampedwindow model and entire data stream is that the counts of itemsets need to beadjusted for each new arrival transactions in the damped window model eventhe itemsets do not appear in these transactions.

In the next Section, we will introduce a new mining algorithmStreamMiningwe proposed recently to find frequent itemsets over the entire data stream.It is a false-positive approach (similar to Lossy Counting). It has provable(user-defined) deterministic bounds on accuracy and very memory efficient.In Section 4, we will review research works closely related with the field onfrequent pattern mining over data streams. Finally, we will conclude this chapterand discuss directions for future work (Section 5).

3. New Algorithm

This section describes our new algorithm for mining frequent itemsets in astream. Initially, we discuss a new approach for finding frequent items fromKarpet al.[25]. We then discuss the challenges in extending this idea to frequentitemset mining, and finally outline our ideas for addressing these issues.

3.1 KPS’s algorithm

Our work is derived from the recent work by Karp, Papadimitriou andShenker on finding frequent elements (or 1-itemset) [25]. Formally, given asequence of lengthN and a thresholdθ (0 < θ < 1), the goal of their work isto determine the elements that occur with frequency greater thanNθ.

A trivial algorithm for this will involve counting the frequency of all distinctelements, and checking if any of them has the desired frequency. If therearendistinct elements, this will requireO(n) memory.

Their approach requires onlyO(1/θ)memory. Their approach can be viewedas a generalization of the following simple algorithm for finding themajorityelementin a sequence. A majority element is an element that appears more thanhalf the time in an entire sequence. We find two distinct elements and eliminatethem from the sequence. We repeat this process until only one distinct elementremains in the sequence. If a majority element exists in the sequence, it willbe left after this elimination. At the same time, any element remaining in thesequence is not necessarily the majority element. We can take another passover the original sequence and check if the frequency of the remaining elementis greater thanN/2.


FindingFrequentItems(SequenceS, θ)global Set P; // Set of PotentiallyP ← ∅; // Frequent Itemsforeach (s ∈ S) // each item in S

if s ∈ Ps.count+ +;

elseP ← s ∪ P;s.count = 1;if |P| ≥ ⌈1/θ⌉

foreach (p ∈ P)p.count−−;if p.count = 0

P ← P − p;Output(P);

Figure 4.1. Karpet al. Algorithm to Find Frequent Items

The idea can be generalized to an arbitraryθ. We can proceed as follows. Wepick any1/θ distinct elements in the sequence and eliminate them together. Thiscan be repeated until no more than1/θ distinct elements remain in the sequence.It can be claimed that any element appearing more thanNθ times will be left inthe sequence. The reason is that the elimination can only be performed at mostN/(1/θ) = Nθ times. During each such elimination, any distinct element isremoved at most once. Hence, for each distinct element, the total number ofeliminations during the entire process is at mostNθ. Any element appearingmore thanNθ times will remain in the sequence. Note, however, the elementsleft in the sequence do not necessarily appear with frequency greaterthanNθ.Thus, this approach will provide a superset of the elements which occur morethanNθ times.

Such processing can be performed to take only a single pass on the sequence,as we show in Figure 4.1.P is the set of potentially frequent items. We maintaina count for each item in the setP. This set is initially empty. As we processa new item from a sequence, we check if it is in the setP. If yes, its countis incremented, otherwise, it is inserted with a count of 1. When the size ofthe setP becomes larger than⌈1/θ⌉, we decrement the count of each item inP, and eliminate any item whose count has now become 0. This processingis equivalent to the eliminations we described earlier. Note that this algorithmrequires onlyΩ(1/θ) space. It computes a superset of frequent items. To findthe precise set of frequent items, another pass can be taken on the sequence,and the frequency of all remaining elements can be counted.


3.2 Issues In Frequent Itemset Mining

In this paper, we build a frequent itemset mining algorithm using the abovebasic idea. There are three main challenges when we apply this idea to miningfrequent itemsets, which we summarize below.

1 Dealing with Transaction Sequences:The algorithm from Karpet al.assumes that a sequence is comprised of elements, i.e., each transactionin the sequence only contains one-items. In frequent itemset mining, eachtransaction has a number of items, and the length of every transaction canalso be different.

2 Dealing with k-itemsets:Karp et al.’s algorithm only finds the frequentitems, or 1-itemsets. In a frequent itemset mining algorithm, we need tofind all k-itemsets,k ≥ 1, in a single pass.

Note that their algorithm can be directly extended to find i-itemsets inthe case where each transaction has a fixed length,l. This can be doneby eliminating a group of(1/θ)× (l

i) different i-itemsets together. This,however, requiresΩ((1/θ)× (l

i)) space, which becomes extremely highwhenl andi are large. Furthermore, in our problem, we have to find alli-itemsets,i ≥ 1, in a single pass.

3 Providing an Accuracy Bound:Karpet al.’s algorithm can provably find asuperset of the frequent items. However, no accuracy bound is providedfor the item(set)s in the superset, which we call the potential frequentitem(set)s. For example, even if an item appears just a single time, it canstill possibly appear in the superset reported by the algorithm. In frequentitemset mining, we will like to improve above result, and provide a boundon the frequency of the itemsets that are reported by the algorithm.

3.3 Key Ideas

We now outline how we can address the three challenges we listed above.Dealing with k-itemsets in a Stream of Transactions:Compared with theproblem of finding frequent items, the challenges in finding frequent itemsetsfrom a transaction sequence mainly arise due to the large number of poten-tial frequent itemsets. This also results in high memory costs. As we statedpreviously, a direct application of the idea from Karpet al. will requireΩ((1/θ)× (l

i)) space to find potential frequent i-itemsets, wherel is the lengthof each transaction. This approach is prohibitively expensive whenl andi arelarge, but can be feasible wheni is small, such as2 or 3.

Recall that most of the existing work on frequent itemset mining uses theApriori property [1], i.e., ani-itemset can be frequent only if all subsets of thisitemset are frequent. One of the drawbacks of this approach has been the large


number of 2-itemsets, especially when the number of distinct items is large,andθ is small.

Our idea is to use ahybridapproach to mine frequent itemsets from a trans-action stream. We use the idea from Karpet al. to determine the potentialfrequent 2-itemsets. Then, we use the set of potential frequent 2-itemsets andthe Apriori property to generate the potentiali-itemsets, fori > 2. This ap-proach finds a set of potential frequent itemsets, which is guaranteed to containall thetrue frequent itemsets, in a single pass of the stream.

Also, if a second pass of the data stream is allowed, we can eliminate allthefalsefrequent itemsets from our result set. The second pass is very easy toimplement, and in the rest of our discussion, we will only focus on the first passof our algorithm.Bounding False Positives:In order to have a accuracy bound, we propose thefollowing criteria for the reported potential frequent itemsets after the firstpass.Besides reporting all items or itemsets that occur with frequency more thanNθ,we want to report only the items or itemsets which appear with frequency atleastNθ(1− ǫ), where0 < ǫ ≤ 1. This criteria is similar to the one proposedby Manku and Motwani [28].

We can achieve this goal by modifying the algorithm as shown in Figure 4.2.In thefirst step, we invoke the algorithm from Karpet al. with the frequencylevel θǫ. This will report a superset of items occurring with frequency morethanNθǫ. We also record the number of eliminations,c, that occur in this step.Clearly,c is bounded byNθǫ. In thesecondstep, we remove all items whosereported frequency is less thanNθ − c ≥ Nθ(1− ǫ).

We have two claims about the above process: 1) it reports all items that occurwith frequency more thanNθ, and 2) it only reports items which appear withfrequency more thanNθ(1 − ǫ). The reason for this is as follows. Considerany element that appears with frequencyNθ. After the first step, it will bereported in the superset with a frequency greater thanc, c ≤ Nθǫ. Therefore,it will remain in the set after the second step also. Similarly, consider any itemthat appears with frequency less thanNθ(1 − ǫ). If this item is present in thesuperset reported after the first step, it will be removed during the second stepsinceNθ− c ≥ Nθ(1− ǫ). This idea can be used for frequent itemset miningalso.

In the next Section, we introduce our algorithm for mining frequent itemsetsfrom streaming data based on the above two ideas.

3.4 Algorithm Overview

We now introduce our new algorithm in three steps. In Subsection 3.5, wedescribe an algorithm for mining frequent itemsets from a data stream, whichassumes that each transaction has the same length. In Subsection 3.6, we extend


FindingFrequentItemsBounded(SequenceS, θ, ǫ)global Set P;P ← ∅;c← 0; // Number of Eliminationforeach (s ∈ S)

if s ∈ Ps.count+ +;

elseP ← s ∪ P;s.count = 1;if |P| ≥ ⌈1/(θǫ)⌉

c+ +; // Count Eliminationsforeach (p ∈ P)

p.count−−;if p.count = 0

P ← P − p;foreach (p ∈ P)

if p.count ≤ (Nθ − c)P ← P − p;

Output(P);

Figure 4.2. Improving Algorithm with An Accuracy Bound


this algorithm to provide an accuracy bound on the potential frequent itemsetscomputed after one pass. In Subsection 3.7, we further extend the algorithm todeal with transactions of variable length.

Before detailing each algorithm, we first introduce some terminology. We aremining a stream of transactionsD. Each transactiont in this stream comprisesa set ofitems, and has the length|t|. Let the number of transactions inD be|D|. Each algorithm takes the support levelθ as one parameter. An itemset inD to be considered frequent should occur more thanθ|D| times.

To store and manipulate the candidate frequent itemsets during any stage ofevery algorithm, a latticeL is maintained.

L = L1 ∪ L2 ∪ . . . ∪ Lk

where,k is largest frequent itemset, andLi, 1 ≤ i ≤ k comprises thepotential frequenti-itemsets. Note that in mining frequent itemsets, the sizeof the setL1, which is bound by the number of distinct items in the dataset, istypically not very large. Therefore, in order to simplify our discussion, we willnot considerL1 in the following algorithms, and assume we can find the exactfrequent 1-itemsets in the streamD. Also, we will directly extend the idea fromKarpet al. to find the potential frequent2-itemsets.

As we stated in the previous section, we deal with allk-itemsets,k > 2, usingthe Apriori property. To facilitate this, we keep a bufferT in each algorithmto store the recently received transactions. The buffer will be accessed severaltimes to find the potential frequentk-itemsets,k > 2.

3.5 Mining Frequent Itemsets from Fixed LengthTransactions

The algorithm we present here mines frequent itemsets from a stream, underthe assumption that each transaction has the same length|t|. The algorithm hastwo interleaved phases. Thefirst phase deals with2-itemsets, and thesecondphase deals withk-itemsets,k > 2. The main algorithm and the associatedsubroutines are shown in Figures 4.3 and 4.4, respectively. Note that thetwosubroutines,UpdateandReducFreq, are used by all the algorithms discussedin this section.

The first phase extends the Karpet al.’s algorithm to deal with2-itemsets.As we stated previously, the algorithm maintains a bufferT which stores therecently received transactions. Initially, the buffer is empty. When a newtransactiont arrives, we put it inT . Next, we call theUpdate routine toincrement counts inL2. This routine simply updates the count of 2-itemsetsthat are already inL2. Other 2-itemsets that are in the transactiont are insertedin the setsL2.


StreamMining-Fixed(StreamD, θ )global Lattice L;local Buffer T ;local Transaction t;L ← ∅; T ← ∅;f← |t| ∗ (|t| − 1)/2;foreach (t ∈ D)

T ← T ∪ t;Update(t,L, 2);if |L2| ≥ ⌈1/θ⌉ · f

ReducFreq(L, 2);∗ Deal with k − itemsets, k > 2 ∗i← 2;while Li 6= ∅

i+ +;foreach (t ∈ T )

Update(t,L, i);ReducFreq(L, i);

T ← ∅;Output(L);

Figure 4.3. StreamMining-Fixed: Algorithm Assuming Fixed Length Transactions

Update(Transactiont, LatticeL, i )for all i subsets sof t

if s∈ Li

s.count+ +;else ifi ≤ 2Li.insert(s);

else if all i− 1 subsets of s∈ Li−1

Li.insert(s);

ReducFreq(LatticeL, i)foreach i itemsets s∈ Li

s.count−−;if s.count = 0

Li.delete(s);

Figure 4.4. Subroutines Description


When the size ofL2 is beyond thethreshold, ⌈1/θ⌉f , wheref is the numberof 2-itemsets per transaction, we call the procedureReducFreqto reduce thecount of each2-itemsets inL2, and the itemsets whose count becomes zero aredeleted. InvokingReducFreqonL2 triggers thesecondphase.

The second phase of the algorithm deals with allk-itemsets,k > 2. Thisprocess is carried out level-wise, i.e, it proceeds from3-itemsets to the largestpotential frequent itemsets. For each transaction in the bufferT , we enumerateall i-subsets. For anyi-subset that is already inL, the process will be the sameas for a2-itemset, i.e, we will simply increment the count. However, ani-subsetthat is not inL will be inserted inL only if all of its i − 1 subsets are inL aswell. Thus, we use theApriori property.

After updatingi-itemsets inL, we will invoke theReducFreqroutine. Thus,the itemsets whose count is only 1 will be deleted from the lattice. This pro-cedure will continue until there are no frequentk-itemsets inL. At the end ofthis, we clear the buffer, and start processing new transactions in the stream.This will restart the first phase of our algorithm to deal with2-itemsets.

We next discuss the correctness and the memory costs of our algorithm. LetLθ

i be the set of frequent i-itemsets with support levelθ in D, andLi be the setof potential frequent i-itemsets provided by this algorithm.

Theorem 1 In using the algorithm StreamMining-Fixed on a set of transac-tions with a fixed length, for anyk ≥ 2, Lθ

k ⊆ Lk.

Lemma 4.1 In using the algorithm StreamMining-Fixed on a set of transac-

tions with a fixed length, the size ofL2 is bounded by(⌈1/θ⌉+ 1)(|t|2 ).

The proofs for the Theorem 1 and the Lemma 4.1 are available in a technicalreport [23]. Theorem 1 implies that any frequentk-itemset is guaranteed to bein the output of our algorithm. Lemma 4.1 provides an estimate of the memorycosts forL2.

3.6 Providing an Accuracy Bound

We now extend the algorithm from the previous subsection to provide abound on the accuracy of the reported results. As described in Subsection 3.3,the bound is described by an user-defined parameter,ǫ, where0 < ǫ ≤ 1. Basedon this parameter, the algorithm ensures that the frequent itemsets reporteddooccur more than(1 − ǫ)θ|D| times in the dataset.

The basic idea for achieving such a bound on frequent items computation wasillustrated in Figure 4.2. We can extend this idea to finding frequent itemsets.Our new algorithm is described in Figure 4.5. Note that we still assume thateach transaction has the same length.

This algorithm provides the new bound on accuracy in two steps. In thefirststep, we invoke the algorithm in Figure 4.3 with the frequency levelθǫ. This


StreamMining-Bounded(StreamD, θ, ǫ )global Lattice L;local Buffer T ;local Transaction t;L ← ∅; T ← ∅;f← |t| ∗ (|t| − 1)/2;c← 0; // Number of ReducFreq Invocationsforeach (t ∈ D)

T ← T ∪ t;Update(t,L, 1);Update(t,L, 2);if |L2| ≥ ⌈1/θǫ⌉ · f

ReducFreq(L, 2);c + +;i← 2;while Li 6= ∅



T ← ∅;foreachs ∈ L

if s.count ≤ θ|D| − cLi.delete(s);

Output(L);

Figure 4.5. StreamMining-Bounded: Algorithm with a Bound on Accuracy


will report a superset of itemsets occurring with frequency more thanNθǫ. Werecord the number of invocations ofReducFreq, c, in the first step. Clearly,cis bounded byNθǫ. In thesecondstep, we remove all items whose reportedfrequency is less thanNθ−c ≥ Nθ(1−ǫ). This is achieved by the lastforeachloop.

The new algorithm has the following property: 1) if an itemset has frequencymore thanθ, it will be reported. 2) if an itemset is reported as a potential frequentitemset, it must have a frequency more thanθ(1−ǫ). Theorem 2 formally statesthis property, and its proof is available in a technical report [23].

Theorem 2 In using the algorithm StreamMining-Bounded on a set of trans-

actions with a fixed length, for anyk ≥ 2, Lθk ⊆ Lk ⊆ L(1−ǫ)θ

k .

Note that the number of invocations ofReducFreq, c, is usually much smallerthanNθǫ after processing a data stream. Therefore, an interesting property ofthis approach is that it produces a very small number of false frequent item-sets, even with relatively largeǫ. The experiments in [22] also support thisobservation.

The following lemma claims that the memory cost ofL2 is increased by afactor proportional to1/ǫ.

Lemma 4.2 In using the algorithm StreamMining-Bounded on a set of trans-

actions with a fixed length, the size ofL2 is bounded by(⌈1/θǫ⌉+ 1)(|t|2 ).

3.7 Dealing with Variable Length Transactions

In this subsection, we present our final algorithm, which improves uponthe algorithm from the previous subsection by dealing with variable lengthtransactions. The algorithm is referred to asStreamMiningand is illustrated inFigure 4.6.

When each transaction has a different length, the number of2-itemsets in eachtransaction also becomes different. Therefore, we cannot simply maintainf , thenumber of2-itemsets per transaction, as a constant. Instead, we maintainf asa weighted average of the number of 2-itemsets that each transaction processedso far. This weighted average is computed by giving higher weightage to therecent transactions. The details are shown in the pseudo-code for the routineTwoItemsetPerTransaction.

To motivate the need for taking such a weighted average, consider the naturalalternative, which will be maintainingf as the average number of2-itemsetsthat each transaction seen so far has. This will not work correctly. Forexample,suppose there are 3 transactions, which have the length 2, 2, and 3, respectively,andθ is 0.5. The first two transactions will have a total of two2-itemsets, andthe third one has 62-itemsets. We will preform an elimination when the numberof different2-itemsets is larger than or equal to(1/θ)× f . When the first two


StreamMining(StreamD, θ, ǫ )global Lattice L;local Buffer T ;local Transaction t;L ← ∅; T ← ∅;f← 0; // Average 2− itemset per transactionc← 0;foreach (t ∈ D)

T ← T ∪ t;Update(t,L, 1);Update(t,L, 2);f← TwoItemsetPerTransaction(t);if |L2| ≥ ⌈1/θǫ⌉ · f

ReducFreq(L, 2);c + +;i← 2;while Li 6= ∅



T ← ∅;foreachs ∈ L

if s.count ≤ θ|D| − cLi.delete(s);

Output(L);

TwoItemsetPerTransaction(Transaction t)global X; // Number of 2 Itemsetglobal N ; // Number of Transactionslocal f ;N + +;

X ← X +

(|t|2

);

f← ⌈X/N⌉;if |L2| ≥ ⌈1/θǫ⌉ · f

N ← N − ⌈1/θǫ⌉;X ← X − ⌈1/θǫ⌉ · f;

return f;

Figure 4.6. StreamMining: Final Algorithm


transactions arrive, an elimination will happen (assuming that the two 2-itemsetsare different). When the third one arrives, the average number of 2-itemsets isless than 3, so another elimination will be performed. Unfortunately, a frequent2-itemset that appears in both transactions 1 and 3 will be deleted in this way.

In our approach, the number of invocations ofReducFreq, c, is less than|D|(θǫ), where|D| is the number of transactions processed so far in the al-gorithm. Lemma 4.3 formalizes this, and its proof is available in a technicalreport [23].

Lemma 4.3 c < |D|(θǫ) is an invariant in the algorithm StreamMining.

Note that by using the Lemma 4.3, we can deduce that the property of theTheorem 2 still holds for mining a stream of transaction with variable transactionlengths. Formally,

Theorem 3 In using the algorithm StreamMining on a stream of transactions

with variable lengths, for anyk ≥ 2, Lθk ⊆ Lk ⊆ L(1−ǫ)θ

k .

An interesting property of our method is that in the situation where eachtransaction has the same length, our final algorithm,StreamMiningwill workin the same fashion as the algorithm previously shown in Figure 4.5.

Note, however, that unlike the case with fixed length transactions, the size ofL2 cannot be bound by a closed formula. Also, in all the algorithms discussedin this section, the size of setsLk, k > 2 also cannot be bound in any way. Ouralgorithms use the Apriori property to reduce their sizes.

Finally, we point out that the new algorithm is very memory efficient. Forexample, Lossy Counting utilizes an out-of-core (disk-resident) data structureto maintain the potentially frequent itemsets. In comparison, we do not need anysuch structure. On the T10.I4.N10K dataset used in their paper, we see that with1 million transactions and a support level of 1%, Lossy Counting requires anout-of-core data-structures on top of even a 44 MB buffer. For datasets rangingfrom 4 million to 20 million transactions, our algorithm only requires 2.5 MBmain memory based summary. In addition, we believe that there are a numberof advantages of an algorithm that does not require an out-of-core summarystructure. Mining on streaming data may often be performed in mobile, hand-held, or sensor devices, where processors do not have attached disks. It is alsowell known that additional disk activity increases the power requirements,andbattery life is an important issue in mobile, hand-held, or sensor devices. Also,while their algorithm is shown to be currently computation-bound, the disparitybetween processor speeds and disk speeds continues to grow rapidly.Thus, wecan expect a clear advantage from an algorithm that does not require frequentdisk accesses.


4. Work on Other Related Problems

In this section, we look at the work on problems that are closely related withthe frequent pattern-mining problem defined in Section 2.Scalable Frequent Pattern Mining Algorithms: A lot of research effort hasbeen dedicated to make frequent pattern mining scalable on very large disk-resident datasets. These techniques usually focus on reducing the passes of thedatasets. They typically try to find a superset or an approximate set of frequentitemsets in the first pass, and then find all the frequent itemsets as well as thecounts in another one or few passes [29], [31], [19]. However, thefirst passalgorithms either do not have appropriate guarantees on the accuracy offrequentitemsets [31], or produces a large number of false positives [29, 19]. Therefore,they are not very suitable for the streaming environment.Mining Frequent Items in Data Streams: Given a potentially infinite se-quence of items, this work tries to identify the items which have higher frequen-cies than a given support level. Clearly, this problem can be viewed a simpleversion of frequent pattern mining over the entire data stream, and indeed mostof the mining algorithms discussed in this chapter are derived from these work.Algorithms in mining frequent items in data streams use different techniques,such as random sketches [8, 13] or sampling [31, 28], and achieve thediffer-ent space requirement. They also have either false-positive or false-negativeproperties. Interested user can look at [34] for more detailed comparison.Finding Top-k Items in Distributed Data Streams: Assuming we have sev-eral distributed data streams and each item might carry different weights ateachof its arrival, the problem is to find thek items which has the highest globalweight. Olston and his colleagues have studied this problem, which they calltop-k monitoring queries[5, 27]. Clearly, in order to maintain the global top-k items in a distributed streaming environment, frequent communication andsynchronization is needed. Therefore, the focus of their research ison reducingthe communication cost. They have proposed a method to achieve such goalby constraining each individual data streams with an arithmetic condition. Thecommunication is only necessary when the arithmetic condition is violated.Finding Heavy Hitters in Data Streams: Cormodeet. al. studied the problemto efficiently identifyheavy hittersin data streams [14]. It can be looked asan interesting variation of frequent-items mining problem. In this problem,there is a hierarchy among different items. Given a frequency levelφ, thecount of an itemi in the hierarchy include all the items which are descendantsof i, and whose counts are less thanφ. An item whose counts exceedsφ iscalled Hierarchy Heavy Hitter (HHH), and we want to find all HHHs in a datastream. They have presented both deterministic and randomized algorithms tofind HHHs.


Frequent Temporal Patterns over Data Streams:Considering a sliding win-dow moves along a data stream, we monitor the counts for an itemset at eachtime point. Clearly, this sequence of counting information for a given itemsetcan be formulated as a time series. Inspired by such observation, Tenget. al.have developed an algorithm to find frequent patterns in sliding window modeland collect sufficient statistics for a regression-based analysis of such time se-ries [30]. They have showed such a framework is applicable in answeringitemsets queries with flexible time-intervals, trend identification, and changedetection.Mining Semi-Structured Data Streams: Asai et. al. developed an efficientalgorithm for mining frequent rooted ordered trees from a semi-structured datastream [3]. In this problem setting, they model a semi-structured dataset asatree with infinite width but finite height. Traversing the tree with a left-mostorder generates the data stream. In other words, each item in the data stream isa node in the tree, and its arriving order is decided by the left-most traversing ofthe tree. They utilize the method in Carma [19] for candidate subtree generation.

5. Conclusions and Future Directions

In this chapter, we gave an overview of the state-of-art in algorithms forfrequent pattern mining over data streams. We also introduced a new approachfor frequent itemset mining. We have developed a new one-pass algorithmfor streaming environment, which has deterministic bounds on the accuracy.Particularly, it does not require any out-of-core memory structure and isverymemory efficient in practice.

Though the existing one-pass mining algorithms have been shown to be veryaccurate and faster than traditional multi-pass algorithms, the experimentalresults show that they are still computationally expensive, meaning that if thedata arrives too rapidly, the mining algorithms will not able to handle the data.Unfortunately, this can be the case for some high-velocity streams, such asnetwork flow data. Therefore, new techniques are needed to increasethe speedof stream mining tasks. We conclude this chapter with a list of future researchproblems to address this challenge.mining maximal and other condensed frequent itemsets in data streams:Maximal frequent itemsets (MFI), and other condensed frequent itemsets,suchas theδ − cover proposed in [32], provide good compression of the frequentitemsets. Mining them are very likely to reduce the mining costs in terms ofboth computation and memory over data streams. However, mining such kindsof compressed pattern set poses new challenges. The existing techniques willlogically partition the data stream into segments, and mine potentially frequentitemsets each segment. In many compressed pattern sets, for instance, MFI,if we just mine MFI for each segment, it will be very hard to find the global


MFI. This is because the MFI can be different in each segment, and whenwecombine them together, we need the counts for the itemsets which are frequentbut not maximal. However, estimating the counts for these itemsets can be verydifficult. The similar problem occurs for other condensed frequent itemsetsmining. Clearly, new techniques are necessary to mine condensed frequentitemsets in data steams.Online Sampling for Frequent Pattern Mining: The current approaches in-volve high-computational cost for mining the data streams. One of the mainreasons is that all of them try to maintain and deliver the potentially frequentpatterns at any time. If the data stream arrives very rapidly, this could be unre-alistic. Therefore, one possible approach is to maintain a sample set which bestrepresents the data stream and provide good estimation of the frequent itemsets.

Compared with existing sampling techniques [31, 9, 6] on disk-residentdatasets for frequent itemsets mining, sampling data streams brings some newissues. For example, the underlying distribution of the data stream can changefrom time to time. Therefore, sampling needs to adapt to the data stream.However, it will be quite difficult to monitor such changes if we do not minethe set of frequent itemsets directly. In addition, the space requirement ofthesample set can be an issue as well. As pointed by Manku and Motwani [28],methods similar to concise sampling [16] might be helpful to reduce the spaceand achieve better mining results.

References

[1] R. Agrawal, H. Mannila, R. Srikant, H. Toivonent, and A. Inkeri Verkamo.Fast discovery of association rules. In U. Fayyad and et al, editors,Ad-vances in Knowledge Discovery and Data Mining, pages 307–328. AAAIPress, Menlo Park, CA, 1996.

[2] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining associationrules between sets of items in large databases. InProceedings of the 1993ACM SIGMOD Conference, pages 207–216, May 1993.

[3] Tatsuya Asai, Hiroki Arimura, Kenji Abe, Shinji Kawasoe, and SetsuoArikawa. Online algorithms for mining semi-structured data stream. InICDM, pages 27–34, 2002.

[4] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models andIssues in Data Stream Systems. InProceedings of the 2002 ACM Sym-posium on Principles of Database Systems (PODS 2002) (Invited Paper).ACM Press, June 2002.

[5] B. Babcock, S. Chaudhuri, and G. Das. Dynamic Sampling for Approx-imate Query Processing. InProceedings of the 2003 ACM SIGMODConference. ACM Press, June 2003.


[6] Herve; Bronnimann, Bin Chen, Manoranjan Dash, Peter Haas, andPeterScheuermann. Efficient data reduction with ease. InKDD ’03: Proceed-ings of the ninth ACM SIGKDD international conference on Knowledgediscovery and data mining, pages 59–68, 2003.

[7] Joong Hyuk Chang and Won Suk Lee. Finding recent frequent itemsetsadaptively over online data streams. InKDD ’03: Proceedings of theninth ACM SIGKDD international conference on Knowledge discoveryand data mining, 2003.

[8] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequentitems in data streams. InICALP ’02: Proceedings of the 29th InternationalColloquium on Automata, Languages and Programming, 2002.

[9] Bin Chen, Peter Haas, and Peter Scheuermann. A new two-phase samplingbased algorithm for discovering association rules. InKDD ’02: Proceed-ings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining, pages 462–468, 2002.

[10] D. Cheung, J. Han, V. NG, and C. Wong. Maintenance of discoveredassociation rules in large databases : an incremental updating technique.In ICDE, 1996.

[11] Yun Chi, Haixun Wang, Philip S. Yu, and Richard R. Muntz. Moment:Maintaining closed frequent itemsets over a stream sliding window. InICDM, pages 59–66, 2004.

[12] Yun Chi, Yirong Yang, and Richard R. Muntz. Hybridtreeminer: Anefficient algorithm for mining frequent rooted trees and free trees usingcanonical forms. InThe 16th International Conference on Scientific andStatistical Database Management (SSDBM’04), 2004.

[13] G. Cormode, M. Datar, P. Indyk, and S. Muthukrishnan. Comparing DataStreams Using Hamming Norms. InProceedings of Conference on VeryLarge Data Bases (VLDB), pages 335–345, 2002.

[14] Graham Cormode, Flip Korn, S. Muthukrishnan, and Divesh Srivastava.Finding hierarchical heavy hitters in data streams. InVLDB, pages 464–475, 2003.

[15] C. Giannella, Jiawei Han, Jian Pei, Xifeng Yan, and P. S. Yu. MiningFre-quent Patterns in Data Streams at Multiple Time Granularities. InProceed-ings of the NSF Workshop on Next Generation Data Mining, November2002.

[16] Phillip B. Gibbons and Yossi Matias. New sampling-based summarystatistics for improving approximate query answers. InACM SIGMOD,pages 331–342, 1998.

[17] Bart Goethals and Mohammed J. Zaki. Workshop Report on Workshopon Frequent Itemset Mining Implementations (FIMI). 2003.


[18] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate gen-eration. InProceedings of the ACM SIGMOD Conference on Managementof Data, 2000.

[19] C. Hidber. Online Association Rule Mining. InProceedings of ACMSIGMOD Conference on Management of Data, pages 145–156. ACMPress, 1999.

[20] Jun Huan, Wei Wang, Deepak Bandyopadhyay, Jack Snoeyink,Jan Prins,and Alexander Tropsha. Mining protein family-specific residue packingpatterns from protein structure graphs. InEighth International Confer-ence on Research in Computational Molecular Biology (RECOMB), pages308–315, 2004.

[21] Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. An apriori-basedalgorithm for mining frequent substructures from graph data. InPrinciplesof Knowledge Discovery and Data Mining (PKDD2000), pages 13–23,2000.

[22] R. Jin and G. Agrawal. An algorithm for in-core frequent itemset miningon streaming data. InICDM, November 2005.

[23] Ruoming Jin and Gagan Agrawal. An algorithm for in-core frequentitemset mining on streaming data. Technical Report OSU-CISRC-2/04-TR14, Ohio State University, 2004.

[24] Ruoming Jin and Gagan Agrawal. A systematic approach for optimizingcomplex mining tasks on multiple datasets. InProceedings of ICDE, 2005.

[25] Richard M. Karp, Christos H. Papadimitriou, and Scott Shanker. A SimpleAlgorithm for Finding Frequent Elements in Streams and Bags. Availablefrom http://www.cs.berkeley.edu/ christos/iceberg.ps, 2002.

[26] Michihiro Kuramochi and George Karypis. Frequent subgraph discovery.In ICDM ’01: Proceedings of the 2001 IEEE International Conference onData Mining, pages 313–320, 2001.

[27] Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, and ChristopherOlston. Finding (recently) frequent items in distributed data streams. InICDE ’05: Proceedings of the 21st International Conference on DataEngineering (ICDE’05), pages 767–778, 2005.

[28] G. S. Manku and R. Motwani. Approximate Frequency Counts Over DataStreams. InProceedings of Conference on Very Large DataBases (VLDB),pages 346 – 357, 2002.

[29] A. Savasere, E. Omiecinski, and S.Navathe. An efficient algorithm formining association rules in large databases. In21th VLDB Conf., 1995.

[30] Wei-Guang Teng, Ming-Syan Chen, and Philip S. Yu. A regression-basedtemporal pattern mining scheme for data streams. InVLDB, pages 93–104,2003.


[31] H. Toivonen. Sampling large databases for association rules. In22ndVLDB Conf., 1996.

[32] Dong Xin, Jiawei Han, Xifeng Yan, and Hong Cheng. Mining compressedfrequent-pattern sets. InVLDB, pages 709–720, 2005.

[33] Xifeng Yan and Jiawei Han. gspan: Graph-based substructurepatternmining. In ICDM ’02: Proceedings of the 2002 IEEE International Con-ference on Data Mining (ICDM’02), page 721, 2002.

[34] Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, and Aoying Zhou. False pos-itive or false negative: Mining frequent itemsets from high speed transac-tional data streams. InProceedings of the 28th International Conferenceon Very Large Data Bases (VLDB), Toronto, Canada, Aug 2004.

[35] M.J. Zaki, S. Parthasarathy, M. Ogihara, and W.Li. Parallel algorithmsfor fast discovery of association rules.Data Mining and Knowledge Dis-covery: An International Journal, 1(4):343–373, December 1997.

[36] Mohammed J. Zaki. Efficiently mining frequent trees in a forest. InKDD’02: Proceedings of the eighth ACM SIGKDD international conferenceon Knowledge discovery and data mining, pages 71–80, 2002.

[37] Mohammed J. Zaki and Charu C. Aggarwal. Xrules: an effective struc-tural classifier for xml data. InKDD ’03: Proceedings of the ninth ACMSIGKDD international conference on Knowledge discovery and data min-ing, pages 316–325, 2003.

Chapter 5

A SURVEY OF CHANGE DIAGNOSISALGORITHMS IN EVOLVING DATASTREAMS


[email protected]

AbstractAn important problem in the field of data stream analysis is change detection

and monitoring. In many cases, the data stream can show changes overtimewhich can be used for understanding the nature of several applications. Wediscuss the concept of velocity density estimation, a technique used to understand,visualize and determine trends in the evolution of fast data streams. We showhow to use velocity density estimation in order to create bothtemporal velocityprofilesandspatial velocity profilesat periodic instants in time. These profilesare then used in order to predict three kinds of data evolution. Methods areproposed to visualize the changing data trends in a single online scan of the datastream, and a computational requirement which is linear in the number of datapoints. In addition, batch processing techniques are proposed in orderto identifycombinations of dimensions which show the greatest amount of global evolution.We also discuss the problem of change detection in the context of graph data,and illustrate that it may often be useful to determine communities of evolutionin graph environments.

The presence of evolution in data streams may also change the underlyingdatato the extent that the underlying data mining models may need to be modifiedto account for the change in data distribution. We discuss a number of methodsfor micro-clustering which are used to study the effect of evolution on problemssuch as clustering and classification.


1. Introduction

In recent years, advances in hardware technology have resulted in automatedstorage of data from a variety of processes. This results in storage which createsmillions of records on a daily basis. Often, the data may show important changesin the trends over time because of changes in the underlying phenomena. Thisprocess is referred to asdata evolution. By understanding the nature of suchchanges, a user may be able to glean valuable insights into emerging trends inthe underlying transactional or spatial activity.

The problem of data evolution is interesting from two perspectives:

For a given data stream, we would like to find the significant changeswhich have occurred in the data stream. This includes methods of vi-sualizing the changes in the data and finding the significant regions ofdata dissolution, coagulation, and shift. The aim of this approach is toprovide adirect understandingof the underlying changes in the stream.Methods such as those discussed in [3, 11, 15, 18] fall into this category.Such methods may be useful in a number of applications such as networktraffic monitoring [21]. In [3], the velocity density estimation methodhas been proposed which can be used in order to visualize different kindsof trends in the data stream. In [11], the difference between two distribu-tions is characterized using the KL-distance between two distributions.Other methods for trend and change detection in massive data sets maybe found in [15]. Methods have also been proposed recently for changedetection in graph data streams [2].

The second class of problems relevant to data evolution is that of updatingdata mining models when a change has occurred. There is a considerableamount of work in the literature with a focus on incremental maintenanceof models in the context of evolving data [10, 12, 24]. However, in thecontext of fast data streams, it is more important to use the evolution ofthe data stream in order to measure the nature of the change. Recentwork [13, 14] has discussed a general framework for quantifying thechanges in evolving data characteristics in the context of several datamining problems and algorithms. The focus of our paper is differentfrom and orthogonal to the work in [13, 14]. Specifically, the work in[13, 14] is focussed on the effects of evolution on data mining models andalgorithms. While these results show some interesting results in terms ofgeneralizing existing data mining algorithms, our view is that data streamshave special mining requirements which cannot be satisfied by usingexisting data mining models and algorithms. Rather, it is necessary totailor the algorithms appropriately to each task. The algorithms discussedin [5, 7] discuss methods for clustering and classification in the presenceof evolution of data streams.

A Survey of Change Diagnosis Algorithms in Evolving Data Streams 87

This chapter will discuss the issue of data stream change in both these con-texts. Specifically, we will discuss the following aspects:

We discuss methods for quantifying the change at a given point of thedata stream. This is done using the concept of velocity density estimation[3].

We show how to use the velocity density in order to construct visual spatialand temporal profiles of the changes in the underlying data stream. Thisprofiles provide a visual overview to the user about the changes in theunderlying data stream.

We discuss methods for utilizing the velocity density in order to character-ize the changes in the underlying data stream. These changes correspondto regions of dissolution, coagulation, and sift in the data stream.

We show how to use the velocity density to determine the overall level ofchange in the data stream. This overall level of change is defined in termsof the evolution coefficient of the data stream. The evolution coefficientcan be used to find interesting combinations of dimensions with a highlevel of global evolution. This can be useful in many applications inwhich we wish to find subsets of dimensions which show a global levelof change.

We discuss how clustering methods can be used to analyze the change indifferent kinds of data mining applications. We discuss the problem ofcommunity evolution in interaction graphs and show how the methodsfor analyzing interaction graphs can be quite similar to other kinds ofmulti-dimensional data.

We discuss the issue of effective application of data mining algorithmssuch as clustering and classification in the presence of change in datastreams. We discuss general desiderata for designing change sensitivedata mining algorithms for streams.

A closely related problem is that of mining spatio-temporal or mobile data[19, 20, 22], for which it is useful to have the ability to diagnose aggregatechanges in spatial characteristics over time. The results in this paper can beeasily generalized to these cases. In such cases, the change trends mayalso beuseful from the perspective of providing physical interpretability to the under-lying change patterns.

This chapter is organized as follows. In the next section, we will introducethe velocity density method and show how it can be used to provide differentkinds of visual profiles. These visual representations may consist of spatial


or temporal velocity profiles. The velocity density method also provides mea-sures which are helpful in measuring evolution in the high dimensional case.In section 3 we will discuss how the process of evolution affects data min-ing algorithms. We will specifically consider the problems of clustering andclassification. We will provide general guidelines as to how evolution can beleveraged in order to improve the quality of the results. In section 4, we discussthe conclusions and summary.

2. The Velocity Density Method

The idea in velocity density is to construct a density based velocity profile ofthe data. This is analogous to the concept of kernel density estimation in staticdata sets. In kernel density estimation [23], we provide a continuous estimateof the density of the data at a given point. The value of the density at a givenpoint is estimated as the sum of the smoothed values of kernel functionsK ′

h(·)associated with each point in the data set. Each kernel function is associatedwith a kernel widthh which determines the level of smoothing created by thefunction. The kernel estimationf(x) based onn data points and kernel functionK ′

h(·) is defined as follows:

f(x) = (1/n) ·n∑

i=1

K ′h(x−Xi) (5.1)

Thus, each discrete pointXi in the data set is replaced by a continuous func-tion K ′

h(·) which peaks atXi and has a variance which is determined by thesmoothing parameterh. An example of such a distribution would be a gaussiankernel with widthh.

K ′h(x−Xi) = (1/

√2π · h) · e−(x−Xi)

2/(2h2) (5.2)

The estimation error is defined by the kernel widthh which is chosen in adata driven manner. It has been shown [23] that for most smooth functionsK ′

h(·), when the number of data points goes to infinity, the estimatorf(x)asymptotically converges to the true density functionf(x), provided that thewidthh is chosen appropriately. For thed-dimensional case, the kernel functionis chosen to be the product ofd identical kernelsKi(·), each with its ownsmoothing parameterhi.

In order to compute the velocity density, we use a temporal windowht

in order to perform the calculations. Intuitively, the temporal windowht isassociated with the time horizon over which the rate of change is measured.Thus, ifht is chosen to be large, then the velocity density estimation techniqueprovides long term trends, whereas ifht is chosen to be small then the trendsare relatively short term. This provides the user flexibility in analyzing thechanges in the data over different kinds of time horizons. In addition, we have


−1000

100200

300400

−100

0

100

200

3000

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

Dimension 1Dimension 2

For

war

d T

ime

Slic

e D

ensi

ty E

stim

ate

Figure 5.1. The Forward Time Slice Density Estimate

−1000

100200

300400

−100

0

100

200

3000

0.005

0.01

0.015

0.02

0.025


Rev

erse

Tim

e S

lice

Den

sity

Est

imat

e

Figure 5.2. The Reverse Time Slice Density Estimate

a spatial smoothing vectorhs whose function is quite similar to the standardspatial smoothing vector which is used in kernel density estimation.

Let t be the current instant andS be the set of data points which have arrivedin the time window(t − ht, t). We intend to estimate the rate of increase indensity at spatial locationX and timet by using two sets of estimates: theforward time slice density estimateand thereverse time slice density estimate.Intuitively, the forward time slice estimate measures the density function for


−1000

100200

300400

−100

0

100

200

300−8

−6

−4

−2

0

2

4

6

8

10

x 10−3


Vel

ocity

Den

sity

Figure 5.3. The Temporal Velocity Profile

−50 0 50 100 150 200 250 300

0

50

100

150

200

250

Dimension 1

Dim

ensi

on 2

Data Coagulation

Data Coagulation

Shift LineData Dissolution

Figure 5.4. The Spatial Velocity Profile


all spatial locations at a given timet based on the set of data points which havearrived in thepast time window(t − ht, t). Similarly, the reverse time sliceestimate measures the density function at a given timet based on the set of datapoints which will arrive in thefuture time window(t, t + ht). Let us assumethat theith data point inS is denoted by(Xi, ti), wherei varies from 1 to|S|.Then, the forward time slice estimateF(hs,ht)(X, t) of the setS at the spatiallocationX and timet is given by:

F(hs,ht)(X, t) = Cf ·|S|∑

i=1

K(hs,ht)(X −Xi, t− ti) (5.3)

HereK(hs,ht)(·, ·) is a spatio-temporal kernel smoothing function,hs is thespatial kernel vector, andht is temporal kernel width. The kernel functionK(hs,ht)(X−Xi, t−ti) is a smooth distribution which decreases with increasingvalue oft− ti. The value ofCf is a suitably chosen normalization constant, sothat the entire density over the spatial plane is one unit. This is done, becauseour purpose of calculating the densities at the time slices is to compute therelativevariations in the density over the different spatial locations. Thus,Cf

is chosen such that we have:∫

All XF(hs,ht)(X, t)δX = 1 (5.4)

The reverse time slice density estimate is also calculated in a somewhatdifferent way to the forward time slice density estimate. We assume that the setof points which have arrived in the time interval(t, t + ht) is given byU . Asbefore, the value ofCr is chosen as a normalization constant. Correspondingly,we define the value of the reverse time slice density estimateR(hs,ht)(X, t) asfollows:

R(hs,ht)(X, t) = Cr ·|U |∑

i=1

K(hs,ht)(X −Xi, ti − t) (5.5)

Note that in this case, we are usingti − t in the argument instead oft − ti.Thus, the reverse time-slice density in the interval(t, t+ ht) would be exactlythe same as the forward time slice density if we assumed that time was reversedand the data stream arrived in reverse order, starting att + ht and ending att.Examples of the forward and reverse density profiles are illustrated in Figures5.1 and 5.2 respectively.

For a given spatial locationX and timeT , let us examine the nature of thefunctionsF(hs,ht)(X,T ) andR(hs,ht)(X,T − ht). Note that both functionsare almost exactly the same, and use the same data points from the interval(T − ht, T ), except that one has been calculated assuming time runs forward,whereas the other has been calculated assuming that the time runs in reverse.


Furthermore, the volumes under each of these curves, when measured over allspatial locationsX is equal to one unit because of the normalization. Corre-spondingly, the density profiles at a given spatial locationX would be differentbetween the two depending upon how the relative trends have changed in theinterval (T − ht, T ). We define the velocity densityV(hs,ht)(X,T ) at spatiallocationX and timeT as follows:

V(hs,ht)(X,T ) =F(hs,ht)(X,T )−R(hs,ht)(X,T − ht)

ht(5.6)

We note that a positive value of the velocity density corresponds to a increasein the data density of a given point. A negative value of the velocity densitycorresponds to a reduction in the data density a given point. In general, ithasbeen shown in [3] that when the spatio-temporal kernel function is defined asbelow, then the velocity density is directly proportional to a rate of change ofthe data density at a given point.

K(hs,ht)(X, t) = (1− t/ht) ·K ′hs

(X) (5.7)

This kernel function is only defined for values oft in the range(0, ht). Thegaussian spatial kernel functionK ′

hs(·) was used because of its well known

effectiveness [23]. Specifically,K ′hs

(·) is the product ofd identical gaussiankernel functions, andhs = (h1

s, . . . hds), wherehi

s is the smoothing parameterfor dimensioni. Furthermore, for the special case of static snapshots, it ispossible to show [3] that he velocity density is proportional to the difference inthe spatial kernel densities of the two sets. Thus, the velocity density approachretains its intuitive appeal under a variety of special circumstances.

In general, we utilize a grid partitioning of the data in order to perform thevelocity density calculation. We pick a total ofβ coordinates along each dimen-sion. For a 2-dimensional system, this corresponds toβ2 spatial coordinates.The temporal velocity profile can be calculated by a simpleO(β2) additiveoperations per data point. For each coordinateXg in the grid, we maintain twosets of counters (corresponding to forward and reverse density counters) whichare updated as each point in the data stream is received. When a data point Xi

is received at timeti, then we add the valueK(hs,ht)(Xg − Xi, t − ti) to theforward density counter, and the valueK(hs,ht)(Xg −Xi, ti − (t− ht)) to thereverse density counter forXg. At the end of timet, the values computed foreach coordinate at the grid need to be normalized. The process of normalizationis the same for either the forward or the reverse density profiles. In eachcase,we sum up the total value in all theβ2 counters, and divide each counter by thistotal. Thus, for the normalized coordinates the sum of the values over all theβ2

coordinates will be equal to 1. Then the reverse density counters are subtractedfrom the forward counters in order to compete the computation.


Successive sets of temporal profiles are generated at user-definedtime-intervals of ofht. In order to ensure online computation, the smoothing param-eter vectorhs for the time-interval(T −ht, T ) must be available at timeT −ht,as soon as the first data point of that interval is scheduled to arrive. Therefore,we need a way of estimating this vector using the data from past intervals. Inorder to generate the velocity density for the interval(T − ht, T ), the spatialkernel smoothing vectorhs is determined using the Silverman’s approximationrule1 [23] for gaussian kernels on the set of data points which arrived in theinterval(T − 2ht, T − ht).

2.1 Spatial Velocity Profiles

Even better insight can be obtained by examining the nature of the spatialvelocity profiles, which provide an insight into how the data is shifting. Foreach spatial point, we would like to compute the directions of movements ofthe data at a given instant. The motivation in developing a spatial velocityprofile is to give a user aspatial overview of the re-organizations in relativedata density at different points. In order to do so, we define anǫ-perturbationalong theith dimension byǫi = ǫ · ei, whereei is the unit vector along theith dimension. For a given spatial locationX, we first compute the velocitygradient along each of thei dimensions. We denote the velocity gradient alongtheith dimension by∆vi(X, t) for spatial locationX and timet. This value iscomputed by subtracting the density at spatial locationX from the density atX + ǫi (ǫ-perturbation along theith dimension), and dividing the result byǫ.The smaller the value ofǫ, the better the approximation. Therefore, we have:

∆vi(X, t) = limǫ⇒0

V(hs,ht)(X + ǫi, t)− V(hs,ht)(X, t)

ǫ(5.8)

The value of∆vi(X, t) is negative when the velocity density decreases withincreasing value of theith coordinate of spatial locationX. The gradient∆v(X, t) is given by(∆v1(X, t) . . .∆vd(X, t)). This vector gives the spatialgradient at a given grid pointboth in terms of direction and magnitude. Thespatial velocity profile is illustrated by creating a spatial plot which illustratesthe directions of the data shifts at different grid points by directed markerswhichmirror these gradients both in terms of directions and magnitude. An exampleof a spatial velocity profile is illustrated in Figure 5.4. If desired, the spatialprofile can be generated continuously for a fast data stream. This continuousgeneration of the profile creates spatio-temporal animations which provide acontinuous idea of the trend changes in the underlying data. Such animationscan also provide real time diagnosis ability for a variety of applications.

An additional useful ability is to be able to concisely diagnose specific trendsin given spatial locations. For example, a user may wish to know particular


spatial locations in the data at which the data is being reduced, those at whichthe data is increasing, and those from where the data is shifting to other locations:

Definition 5.1 A data coagulation for time slicetand user defined thresholdmin-coag is defined to be a connected regionR in the data space, so that foreach pointX ∈ R, we haveV(hs,ht)(X, t) > min-coag> 0.

Thus, a data coagulation is a connected region in the data which has velocitydensity larger than a user-defined noise threshold ofmin-coag. In terms ofthe temporal velocity profile, these are the connected regions in the data withelevations larger thanmin-coag. Note that there may be multiple such elevatedregions in the data, each of which may be disconnected from one another.Eachsuch region is a separate area of data coagulation, since they cannot beconnectedby a continuous path above the noise threshold. For each such elevated region,we would also have a local peak, which represents the highest density in thatlocality.

Definition 5.2 The epicenter of a data coagulationR at time slicet isdefined to be a spatial locationX∗ such thatX∗ ∈ R and for anyX ∈ R, wehaveV(hs,ht)(X, t) ≤ V(hs,ht)(X

∗, t).

Similarly regions of data dissolution and corresponding epicenters can be de-termined.

Definition 5.3 A data dissolution for time slicet and user defined thresholdmin-dissol is defined to be a connected regionR in the data space, so that foreach pointX ∈ R, we haveV(hs,ht)(X, t) < −min-dissol< 0.

We define the epicenter of a data dissolution as follows:

Definition 5.4 The epicenter of a data dissolutionRat time slicet is definedto be a spatial locationX∗ such thatX∗ ∈ R and for anyX ∈ R, we haveV(hs,ht)(X, t) ≥ V(hs,ht)(X

∗, t).

A region of data dissolution and its epicenter is calculated in an exactly analo-gous way to the epicenter of a data coagulation. It now remains to discuss howsignificant shifts in the data can be detected. Many of the epicenters of coagu-lation and dissolution are connected in a way which results in a funneling of thedata from the epicenters of dissolution to the epicenters of coagulation. Whenthis happens, it is clear that the two phenomena of dissolution and coagulationare connected to one another. We refer to such a phenomenon as aglobal datashift. The detection of such shifts can be useful in many problems involvingmobile objects. How to find whether a pair of epicenters are connected in thisway?

In order to detect such a phenomenon we use the intuition derived from theuse of the spatial velocity profiles. Let us consider a directed line drawn from


an epicenter to data dissolution to an epicenter of data coagulation. In orderfor this directed line to be indicative of a global data shift, the spatial velocityprofile should be such that the directions of a localized shifts along each ofthepoints in this directed line should be in roughly in the same direction as theline itself. If at any point on this directed line, the direction of the localizedshift is in an opposite direction, then it is clear that the these two epicenters aredisconnected from one another. In order to facilitate further discussion, we willrefer to the line connecting two epicenters as apotential shift line.

Recall that the spatial velocity profiles provide an idea of the spatial move-ments of the data over time. In order to calculate the nature of the data shift, wewould need to calculate the projection of the spatial velocity profiles along thispotential shift line. In order to do so without scanning the data again, we usethe grid points which are closest to this shift line in order to obtain an approxi-mation of the shift velocities at various points along this line. The first step is tofind all the elementary rectangles which are intersected by the shift line. Oncethese rectangles have been found we determine the grid points correspondingto the corners of these rectangles. These are the grid points at which the spatialvelocity profiles are examined.

Let the set ofn grid points thus discovered be denoted byY1 . . . Yn. Thenthe corresponding spatial velocities at these grid points at time slicet are∆v(Y1, t) . . .∆v(Yn, t). Let L be the unit vector in the direction of the shiftline. We assume that this vector is directed from the region of dissolution tothe area of coagulation. Then the projections of the spatial velocities in thedirection of the shift line are given byL ·∆v(Y1, t) . . .L ·∆v(Yn, t). We shallrefer to these values asp1 . . . pn respectively. For a shift line to expose an actualmovement of the data, the values ofp1 . . . pn must all be substantially positive.In order to quantify this notion, we introduce a user-defined parameter calledmin-vel. A potential shift line is said to be a valid shift when each of valuesp1 . . . pn is larger thanmin-vel.

Thus, in order to determine the all the possible data shifts, we first find all co-agulation and dissolution epicenters for user-defined parametersmin-coagandmin-dissolrespectively. Then we find all the potential shift lines by connectingeach dissolution epicenter to a coagulation epicenter. For each such shiftline,we find the grid points which are closest to it using the criteria discussed above.Finally, for each of these grid points, we determine the projection of the corre-sponding shift velocities along this line and check whether each of them is atleastmin-vel. If so, then this direction is reported as a valid shift line.

2.2 Evolution Computations in High Dimensional Case

In this section, we will discuss how to determine interesting combinationsof dimensions with a high level of global evolution. In order to do so, we need


to have a measure for the overall level of evolution in a given combination ofdimensions. By integrating the value of the velocity density over the entirespatial area, we can obtain the total rate of change over the entire spatial area.In other words, ifE(hs,ht)(t) be the total evolution in the period(t−ht, t), thenwe have:E(hs,ht)(t) = ht

∫all X |V(hs,ht)(X, t)|δX

Intuitively, the evolution coefficient measures the total volume of the evo-lution in the time horizon(t − ht, t). It is possible to calculate the evolutioncoefficients of particular projections of the data by using only the correspondingsets of dimensions in the density calculations. In [3] it has been shown howthe computation of the evolution coefficient can be combined with an a-priorilike rollup approach in order to find the set ofminimal evolving projections. Inpractice, the number of minimal evolving projections is relatively small, andtherefore large part of the search space can be pruned. This resultsin an effectivealgorithm for finding projections of the data which show a significant amountof evolution. In many applications, the individual attributes may not evolvea lot, but the projections may evolve considerably because of the changesinrelationships among the underlying attributes. This can be useful in a numberof applications such as target marketing or multi-dimensional trend analysis.

2.3 On the use of clustering for characterizing streamevolution

We note that methods such as clustering can be used to characterize the streamevolution. For this purpose, we utilize the micro-clustering methodology whichis discussed2 in [5]. We note that clustering is a natural choice to study broadchanges in trends, since it summarizes the behavior of the data.

In this technique, micro-clusters are utilized in order to determine suddenchanges in the data stream. Specifically, new trends in the data show up as newmicro-clusters, whereas declining trends correspond to disappearing micro-clusters. In [5], we have illustrated the effectiveness of this kind of techniqueon an intrusion detection application. In general, the micro-clustering methodis useful for change detection in a number of unsupervised applications wheretraining data is not readily available, and anomalies can only be detected assudden changes in the underlying trends. In the same paper, we have alsoshown some examples of how the method may be used for intrusion detection.

Such an approach has also been extended to the case of graph and structuraldata sets. In [2], we use a clustering technique in order to determine commu-nity evolution in graph data streams. Such a clustering technique is useful inmany cases in which we need to determine changes in interaction over differententities. In such cases, the entities may represent nodes of a graph and theinteractions may correspond to edges. A typical example of an interaction may


be a phone call between two entities, or the co-authorship of a paper betweentwo entities. In many cases, these trends of interaction may change over time.Such trends include the gradual formation and dissolution of different com-munities of interaction. In such cases, a user may wish to perform repeatedexploratory queryingof the data for different kinds of user-defined parameters.For example, a user may wish to determine rapidly expanding or contractingcommunities of interest over different time frames. This is difficult to performin a fast data stream because of the one-pass constraints on the computations.Some examples of queries which may be performed by a user are as follows:(1) Find the communities with substantial increase in interaction level in theinterval(t− h, t). We refer to such communities asexpanding communities.(2) Find the communities with substantial decrease in interaction level in theinterval(t− h, t) We refer to such communities ascontracting communities.(3) Find the communities with the most stable interaction level in the interval(t− h, t).

In order to resolve such queries, the method in [2] proposes an online an-alytical processing framework which separates out online data summarizationfrom offline exploratory querying. The process of data summarization storesportions of the graph on disk at specific periods of time. This summarizeddata is then used in order to resolve different kinds of queries. The result isa method which provides the ability to perform exploratory querying withoutcompromising on the quality of the results. In this context, the clustering of thegraph of interactions is a key component. The first step is to create adifferentialgraph which represents the significant changes in the data interactions over theuser specified horizon. This is done using the summary information stored onthe disk. Significant communities of change show up as clusters in this graph.The clustering process is able to find sub-graphs which represent a sudden for-mation of a cluster of interactions which correspond to the underlying changein the data. It has been shown in [2], that this process can be performedin anefficient and effective way, and can identify both expanding and contractingcommunities.

3. On the Effect of Evolution in Data Mining Algorithms

The discussion in this chapter has so far concentrated only on the problemofanalyzing and visualizing the change in a data stream directly. In many cases,it is also desirable to analyze the evolution in a more indirect way, when suchstreams are used in conjunction with data mining algorithms. In this section,we will discuss the effects of evolution on data mining algorithms. The problemof mining incremental data dynamically has often been studied in many datamining scenarios [7, 10, 12, 24]. However, many of these methods are often


not designed to work well with data streams since the distribution of the dataevolves over time.

Some recent results [13] discuss methods for mining data streams under blockevolution. We note that these methods are useful for incrementally updatingthe model when evolution has taken place. While the method has a number ofuseful characteristics, it does not attempt to determine the optimal segment ofthe data to be used for modeling purposes or provide an application-specificmethod to weight the relative importance of more recent or past data points. Inmany cases, the user may also desire to have the flexibility to analyze the datamining results over different time horizons. For such cases, it is desirabletouse an online analytical processing framework which can store the underlyingdata in a summarized format over different time horizons. In this respect, it isdesirable to store summarized snapshots [5, 7] of the data over differentperiodsof time.

In order to store the data in a summarized format, we need the following twocharacteristics:

We need a method for condensing the large number of data points inthe stream into condensed summary statistics. In this respect the use ofclustering is a natural choice for data condensation.

We need a method for storing the condensed statistics over differentperiods of time. This is necessary in order to analyze the characteristicsof the data over different time horizons. We note that the storage of thecondensed data at each and every time unit can be expensive both in termsof computational resources and storage space. Therefore, a method needsto be used so that a small amount of data storage can retain a high level ofaccuracy in horizon-recall. This technique is known as the pyramidal orgeometric time frame. In this technique, a constant number of snapshotsof different orders are stored. The snapshots of theith order occur atintervals which are divisible byαi for someα > 1. It can be shownthat this storage pattern provides constant guarantees on the accuracyofhorizon estimation.

Another property of the stored snapshots in [5] is that the correspondingstatisticsshow the additivity property. The additivity property ensures that it is possibleto obtain the statistics over a pre-defined time window by subtracting out thestatistics of the previous window from those of the current window. Thus,itis possible to examine the evolving behavior of the data over different timehorizons.

Once the summarized snapshots are stored in this pattern, they can be lever-aged for a variety of data mining algorithms. For example, for the case of theclassification problem [7], the underlying data may show significant changetrends which result in different optimal time horizons. For this purpose, one


can use the statistics over different time horizons. One can use a portion ofthetraining stream to determine the horizon which provides the optimal classifica-tion accuracy. This value of the horizon is used in order to perform the finalclassification. The results in [7] show that there is a significant improvementinaccuracy from the use of horizon specific classification.

This technique is useful not just for the classification problem but also for avariety of problem in the evolving scenario. For example, in many cases, onemay desire to forecast the future behavior of an evolving data stream. In suchcases, the summary statistics can be used to make broad trends about the futurebehavior of the stream. In general, for the evolving scenario, it is desirable tohave the following characteristics for data stream mining algorithms:

It is desirable to leverage temporal locality in order to improve the miningeffectiveness. The concept of temporal locality refers to the fact thatthedata points in the stream are not randomly distributed. Rather the points ata given period in time are closely correlated, and may show specific levelsof evolution in different regions. In many problems such as classificationand forecasting, this property can be leveraged in order to improve thequality of the mining process.

It is desirable to have the flexibility of performing the mining over differ-ent time horizons. In many cases, the optimal results can be obtained onlyafter applying the results of the algorithm over a variety of time horizons.An example of this case is illustrated in [7], in which the classificationproblem is solved by finding the optimal accuracy over different horizons.

In many problems, it is possible to perform incremental maintenanceby using decay-specific algorithms. In such cases, recent points areweighted more heavily than older points during the mining process. Theweight of the data points decay according a pre-defined function which isapplication-specific. This function is typically chosen as an exponentialdecay function whose decay is defined in terms of the exponential decayparameter. An example of this situation is the high dimensional projectedstream clustering algorithm discussed in [6].

In many cases, synopsis construction algorithms such as sampling maynot work very well in the context of an evolving data stream. Traditionalreservoir sampling methods [25] may end up summarizing the stale his-tory of the entire data stream. In such cases, it may be desirable touse a biased sampling approach which maintains the temporal stabilityof the stream sample. The broad idea is to construct a stream samplewhich maintain the points in proportion to their decay behavior. Thisis a challenging task for a reservoir construction algorithm, and is notnecessarily possible for all decay functions. The method in [8] proposes


a new method for reservoir sampling in the case of certain kinds of decayfunctions.

While the work in [13] proposes methods for monitoring evolving data streams,this framework does not account for the fact that different methodologies mayprovide the most effective stream analysis in different cases. For someprob-lems, it may be desirable to use a decay based model, and for others it may bedesirable to use only a subset of the data for the mining process. In general, themethodology used for a particular algorithm depends upon the details of thatparticular problem and the data. For example, for some problems such as highdimensional clustering [6], it may be desirable to use a decay-based approach,whereas for other problems such as classification, it may be desirable usethestatistics over different time horizons in order to optimize the algorithmic effec-tiveness. This is because problems such as high dimensional clustering requirea large amount of data in order to provide effective results, and historical clus-ters do provide good insights about the future clusters in the data. Therefore,it makes more sense to use all the data, but with an application specific decay-based approach which provides the new data greater weight than the older data.On the other hand, in problems such as classification, the advantages of usingmore data is much less relevant to the quality of the result than using the datawhich is representative of the current trends in the data. The discussionof thissection provides clues to the kind of approaches that are useful for re-designingdata mining algorithms in the presence of evolution.

4. Conclusions

In this paper, we discussed the issue of change detection in data streams.We discussed different methods for characterizing change in data streams. Forthus purpose, we discussed the method of velocity density estimation and itsapplication to different kinds of visual representations of changes in theunder-lying data. We also discussed the problem of online community evolution infast data streams. In many of these methods, clustering is a key componentsince it allows us to summarize the data effectively. We also studied the reverseproblem of how data mining models are maintained when the underlying datachanges. In this context, we studied the problems of clustering and classifica-tion of fast evolving data streams. The key in many of these methods is to use anonline analytical processing methodology which preprocesses and summarizessegments of the data stream. These summarized segments can be used for avariety of data mining purposes such as clustering and classification.

Notes1. According to Silverman’s approximation rule, the smoothingparameter for a data set withn points

and standard deviationσ is given by1.06 ·σ ·n−1/5. For thed-dimensional case, the smoothing parameter


along each dimension is determined independently using the corresponding dimension-specific standarddeviation.

2. The methodology is also discussed in an earlier chapter of this book.

References

[1] Aggarwal C., Procopiuc C., Wolf J., Yu P., Park J.-S. (1999). Fastalgorithmsfor projected clustering.ACM SIGMOD Conference.

[2] Aggarwal C., Yu P. S (2005). Online Analysis of Community Evolution inData Streams.ACM SIAM Conference on Data Mining.

[3] Aggarwal C (2003). A Framework for Diagnosing Changes in EvolvingData Streams.ACM SIGMOD Conference.

[4] Aggarwal C (2002). An Intuitive Framework for understanding Changes inEvolving Data Streams.IEEE ICDE Conference.


[6] Aggarwal C., Han J., Wang J., Yu P (2004). A Framework for High Dimen-sional Projected Clustering of Data Streams.VLDB Conference.


[8] Aggarwal C. (2006). On Biased Reservoir Sampling in the presenceofstream evolution.VLDB Conference.

[9] Chawathe S., Garcia-Molina H. (1997). Meaningful Change Detection inStructured Data.ACM SIGMOD Conference Proceedings.

[10] Cheung D., Han J., Ng V., Wong C. Y. (1996). Maintenance of Discov-ered Association Rules in Large Databases: An Incremental Updating Tech-nique.IEEE ICDE Conference Proceedings.

[11] Dasu T., Krishnan S., Venkatasubramaniam S., Yi K. (2005).An Information-Theoretic Approach to Detecting Changes in Multi-dimensional data Streams.Duke University Technical Report CS-2005-06.

[12] Donjerkovic D., Ioannidis Y. E., Ramakrishnan R. (2000). Dynamic His-tograms: Capturing Evolving Data Sets.IEEE ICDE Conference Proceed-ings.

[13] Ganti V., Gehrke J., Ramakrishnan R (2002). Mining Data Streams underBlock Evolution.ACM SIGKDD Explorations, 3(2), 2002.

[14] Ganti V., Gehrke J., Ramakrishnan R., Loh W.-Y. (1999). A Frameworkfor Measuring Differences in Data Characteristics.ACM PODS ConferenceProceedings.

[15] Gollapudi S., Sivakumar D. (2004) Framework and Algorithms for TrendAnalysis in Massive Temporal DataACM CIKM Conference Proceedings.



[17] Jain A., Dubes R. (1998). Algorithms for Clustering Data,Prentice Hall,New Jersey.

[18] Kifer D., David S.-B., Gehrke J. (2004). Detecting Change in DataStreams.VLDB Conference, 2004.

[19] Roddick J. F. et al (2000). Evolution and Change in Data Management:Issues and Directions.ACM SIGMOD Record, 29(1): pp. 21–25.

[20] Roddick J. F., Spiliopoulou M (1999). A Bibliography of Temporal, Spa-tial, and Spatio-Temporal Data Mining Research.ACM SIGKDD Explo-rations, 1(1).

[21] Schweller R., Gupta A., Parsons E., Chen Y. (2004) Reversible Sketchesfor Efficient and Accurate Change Detection over Network Data Streams.Internet Measurement Conference Proceedings.

[22] Sellis T (1999). Research Issues in Spatio-temporal Database Systems.Symposium on Spatial Databases Proceedings.

[23] Silverman B. W. (1986).Density Estimation for Statistics and Data Anal-ysis. Chapman and Hall.

[24] Thomas S., Bodagala S., Alsabti K., Ranka S. (1997). An EfficientAlgorithm for the Incremental Updating of Association Rules in LargeDatabases.ACM KDD Conference Proceedings.

[25] Vitter J. S. (1985) Random Sampling with a Reservoir.ACM Transactionson Mathematical Software, Vol. 11(1), pp 37–57.

Chapter 6

MULTI-DIMENSIONAL ANALYSIS OF DATASTREAMS USING STREAM CUBES

Jiawei Han,1 Y. Dora Cai,1 Yixin Chen,2 Guozhu Dong,3 Jian Pei,4 BenjaminW. Wah,1 and Jianyong Wang5

1University of Illinois, Urbana, Illinois

hanj, ycai, [email protected]

2Washington University, St. Louis, Missouri

[email protected]

3Wright State University, Dayton, Ohio

[email protected]

4Simon Fraser University, British Columbia, Canada

[email protected]

5Tsinghua University, Beijing, China

[email protected]

Abstract Large volumes of dynamic stream data pose great challenges to its analysis.Besides its dynamic and transient behavior, stream data has another importantcharacteristic:multi-dimensionality. Much of stream data resides at a multi-dimensional space and at rather low level of abstraction, whereas most analystsare interested in relatively high-level dynamic changes in some combinationofdimensions. To discover high-level dynamic and evolving characteristics, onemay need to perform multi-level, multi-dimensional on-line analytical process-ing (OLAP) of stream data. Such necessity calls for the investigation of newar-chitectures that may facilitate on-line analytical processing of multi-dimensionalstream data.

In this chapter, we introduce an interestingstream cubearchitecture that ef-fectively performs on-line partial aggregation of multi-dimensional stream data,captures the essential dynamic and evolving characteristics of data streams, andfacilitates fast OLAP on stream data. Three important techniques are proposed for


the design and implementation of stream cubes. First, atilted time framemodelis proposed to register time-related data in a multi-resolution model: The morerecent data are registered at finer resolution, whereas the more distant data areregistered at coarser resolution. This design reduces the overall storage require-ments of time-related data and adapts nicely to the data analysis tasks commonlyencountered in practice. Second, instead of materializing cuboids at all levels,two critical layers: observation layerandminimal interesting layer, are main-tained to support routine as well as flexible analysis with minimal computationcost. Third, an efficient stream data cubing algorithm is developed that computesonly the layers (cuboids) along apopular pathand leaves the other cuboids foron-line, query-driven computation. Based on this design methodology,streamdata cube can be constructed and maintained incrementally with reasonablemem-ory space, computation cost, and query response time. This is verified by oursubstantial performance study.

Stream cube architecture facilitates online analytical processing of streamdata. It also forms a preliminary structure for online stream mining. The impactof the design and implementation of stream cube in the context of stream miningis also discussed in the chapter.

Keywords: Data streams, multidimensional analysis, OLAP, data cube, stream cube,tiltedtime frame, partial materialization.

1. Introduction

A fundamental difference in the analysis of stream data from that of non-stream one is that the stream data is generated in huge volumes, flowing in-and-out dynamically, and changing rapidly. Due to limited resources availableand the usual requirements of fast response, most data streams may not be fullystored and may only be examined in a single pass. These characteristics ofstream data have been emphasized and explored in their investigations by manyresearchers, such as ([6, 8, 17, 18, 16]), and efficient stream data querying,counting, clustering and classification algorithms have been proposed, such as([2, 3, 22, 17, 18, 16, 25]). However, there is another important characteristicof stream data that has not drawn enough attention:Most of stream data setsare multidimensional in nature and reside at rather low level of abstraction,whereas an analyst is often more interested in higher levels of abstraction ina small subset of dimension combinations.Similar to OLAP analysis of staticdata, multi-level, multi-dimensional on-line analysis should be performed onstream data as well. This can be seen from the following example.

Example 6.1 One may observe infinite streams of power usage data in apower supply system. The lowest granularity of such data can be individualhousehold and second. Although there are tremendous varieties at analyzingsuch data, the most useful online stream data analysis could be the analysisofthe fluctuation of power usage at certain dimension combinations and at certain

Multi-Dimensional Analysis of Data Streams Using Stream Cubes 105

high levels, such as by region and by quarter (of an hour), making timely powersupply adjustments and handling unusual situations. ⋄

One may easily link such multi-dimensional analysis with the online ana-lytical processing of multi-dimensional nonstream data sets. For analyzing thecharacteristics of nonstream data, the most influential methodology is to usedata warehouse and OLAP technology ([14, 11]). With this technology, datafrom different sources are integrated, and then aggregated in multi-dimensionalspace, either completely or partially, generating data cubes. The computedcubes can be stored in the form of relations or multi-dimensional arrays ([1, 31])to facilitate fast on-line data analysis. In recent years, a large number ofdatawarehouses have been successfully constructed and deployed in applications,and data cube has become an essential component in most data warehousesys-tems and in some extended relational database systems for multidimensionaldata analysis and intelligent decision support.

Can we extend the data cube and OLAP technology from the analysis ofstatic, pre-integrated data to that of dynamically changing stream data, in-cluding time-series data, scientific and engineering data, and data producedin other dynamic environments, such as power supply, network traffic,stockexchange, telecommunication data flow, Web click streams, weather or envi-ronment monitoring?The answer to this question may not be so easy since,as everyone knows, it takes great efforts and substantial storage space to com-pute and maintain static data cubes. A dynamic stream cube may demand aneven greater computing power and storage space.How can we have sufficientresources to compute and store a dynamic stream cube?

In this chapter, we examine this issue and propose an interesting architecture,calledstream cube, for on-line analytical processing of voluminous, infinite,and dynamic stream data, with the following design considerations.

1 For analysis of stream data, it is unrealistic to store and analyze data withan infinitely long and fine scale on time. We propose atilted time frameas the general model oftime dimension. In the tilted time frame, timeis registered at different levels of granularity. The most recent time isregistered at the finest granularity; the more distant time is registered atcoarser granularity; and the level of coarseness depends on the applicationrequirements and on how distant the time point is from the current one.This model is sufficient for most analysis tasks, and at the same time italso ensures that the total amount of data to retain in memory or to bestored on disk is quite limited.

2 With limited memory space in stream data analysis, it is often still toocostly to store a precomputed cube, even with thetilted time frame. Wepropose to compute and store only twocritical layers (which are es-sentially cuboids) in the cube: (1) anobservation layer, calledo-layer,


which is the layer that an analyst would like to check and make deci-sions for either signaling the exceptions or drilling on the exception cellsdown to lower layers to find their corresponding lower level exceptions;and (2) theminimal interesting layer, calledm-layer, which is the mini-mal layer that an analyst would like to examine, since it is often neithercost-effective nor practically interesting to examine the minute detail ofstream data. For example, in Example 1, we assume that theo-layer isuser-category, region, andquarter, while them-layer isuser, city-block,andminute.

3 Storing a cube at only two critical layers leaves much room on what andhow to compute for the cuboids between the two layers. We proposeone method, calledpopular-path cubing, which rolls up the cuboidsfrom them-layer to theo-layer, by following the most popular drillingpath, materializes only the layers along the path, and leaves other layersto be computed at OLAP query time. An H-tree data structure is usedhere to facilitate efficient pre- and on-line computation. Our performancestudy shows that this method achieves a good trade-off between space,computation time, and flexibility, and has both quick aggregation timeand query answering time.

The remaining of the paper is organized as follows. In Section 2, we definethe basic concepts and introduce the problem. In Section 3, we present anarchi-tectural design for on-line analysis of stream data by introducing the concepts oftilted time frameandcritical layers. In Section 4, we present thepopular-pathcubing method, an efficient algorithm for stream data cube computation thatsupports on-line analytical processing of stream data. Our experiments andperformance study of the proposed methods are presented in Section 5. Therelated work and possible extensions of the model are discussed in Section6,and our study is concluded in Section 7.

2. Problem Definition

LetDB be a relational table, called thebase table, of a given cube. The setof all attributesA in DB are partitioned into two subsets, thedimensionalattributesDIM and themeasure attributesM (so DIM ∪ M = A andDIM ∩M = ∅). The measure attributes functionally depend on the dimen-sional attributes inDB and are defined in the context of data cube using sometypical aggregate functions, such asCOUNT, SUM, AVG, or more sophisti-cated computational functions, such as standard deviation and regression.

A tuple with schemaA in a multi-dimensional space (i.e., in the context ofdata cube) is called acell. Given three distinct cellsc1, c2 andc3, c1 is anancestorof c2, andc2 a descendantof c1 iff on every dimensional attribute,eitherc1 andc2 share the same value, orc1’s value is a generalized value of


c2’s in the dimension’s concept hierarchy.c2 is a sibling of c3 iff c2 andc3have identical values in all dimensions except one dimensionA wherec2[A]andc3[A] have the same parent in the dimension’s domain hierarchy. A cellwhich hask non-* values is called ak-d cell. (We use “∗” to indicate “all”, i.e.,the highest level on any dimension.)

A tuplec ∈ D is called abase cell. A base cell does not have any descendant.A cell c is anaggregated celliff it is an ancestor of some base cell. For eachaggregated cellc, its values on the measure attributes are derived from thecomplete set of descendant base cells ofc. An aggregated cellc is anicebergcell iff its measure value satisfies a specified iceberg condition, such as measure≥ val1. The data cube that consists of all and only the iceberg cells satisfying aspecified iceberg conditionI is called theiceberg cubeof a databaseDB underconditionI.

Notice that in stream data analysis, besides the popularly used SQL aggregate-based measures, such asCOUNT, SUM, MAX, MIN, andAVG, regressionisa useful measure. A stream data cell compression technique LCR (linearlycompressed representation) is developed in ([12]) to support efficient on-lineregression analysis of stream data in data cubes. The study in ([12]) showsthat for linear and multiple linear regression analysis, only a small number ofregression measuresrather than the complete stream of data need to be reg-istered. This holds for regression on both the time dimension and the otherdimensions. Since it takes a much smaller amount of space and time to handleregression measures in a multi-dimensional space than handling the stream dataitself, it is preferable to construct regression(-measured) cubes by computingsuch regression measures.

A data streamis considered as a voluminous, infinite flow of data records,such as power supply streams, Web click streams, and telephone calling streams.The data is collected at the most detailed level in a multi-dimensional space,which may represent time, location, user, and other semantic information. Dueto the huge amount of data and the transient behavior of data streams, mostof the computations will scan a data stream only once. Moreover, the directcomputation of measures at the most detailed level may generate a huge numberof results but may not be able to disclose the general characteristics and trendsof data streams. Thus data stream analysis will require to consider aggregationsand analysis at multi-dimensional and multi-level space.

Our task is to supportefficient, high-level, on-line, multi-dimensional analy-sis of such data streams in order to find unusual (exceptional) changesof trends,according to users’ interest, based on multi-dimensional numerical measures.This may involve construction of a data cube, if feasible, to facilitate on-line,flexible analysis.


3. Architecture for On-line Analysis of Data Streams

To facilitate on-line, multi-dimensional analysis of data streams, we proposea stream cube architecture with the following features: (1)tilted time frame,(2) two critical layers: a minimal interesting layerand anobservation layer,and (3)partial computation of data cubes by popular-path cubing. The streamdata cubes so constructed are much smaller than those constructed from therawstream data but will still be effective for multi-dimensional stream data analysistasks.

3.1 Tilted time frame

In stream data analysis, people are usually interested in recent changesat afine scale, but long term changes at a coarse scale. Naturally, one canregistertime at different levels of granularity. The most recent time is registered at thefinest granularity; the more distant time is registered at coarser granularity; andthe level of coarseness depends on the application requirements and on howdistant the time point is from the current one.

There are many possible ways to design a titled time frame. We adopt threekinds of models: (1)natural tilted time frame model(Fig. 6.1), (2)logarithmicscale tilted time frame model(Fig. 6.2), and (3)progressive logarithmic tiltedtime frame model(Fig. 6.3).

Time Now

4 qtrs 15 minutes7 days 24 hours

Figure 6.1. A tilted time frame with natural time partition

Time t 8t 4t 2t t 16t 32t 64t

Now

Figure 6.2. A tilted time frame with logarithmic time partition

A natural tilted time frame modelis shown in Fig. 6.1, where the time frameis structured in multiple granularity based on natural time scale: the most recent4 quarters (15 minutes), then the last 24 hours, 31 days, and 12 months (theconcrete scale will be determined by applications). Based on this model, onecan compute frequent itemsets in the last hour with the precision of quarter ofanhour, the last day with the precision of hour, and so on, until the whole year, withthe precision of month (we align the time axis with the natural calendar time.


Frame no. Snapshots (by clock time)

0 69 67 651 70 66 622 68 60 523 56 40 244 48 165 64 32

Figure 6.3. A tilted time frame with progressive logarithmic time partition

Thus, for each granularity level of the tilt time frame, there might be a partialinterval which is less than a full unit at that level.) This model registers only4+24+31+12 = 71 units of time for a year instead of366×24×4 = 35, 136units, a saving of about 495 times, with an acceptable trade-off of the grain ofgranularity at a distant time.

The second choice islogarithmic tilted time modelas shown in Fig. 6.2, wherethe time frame is structured in multiple granularity according to a logarithmicscale. Suppose the current frame holds the transactions in the current quarter.Then the remaining slots are for the last quarter, the next two quarters, 4 quarters,8 quarters, 16 quarters, etc., growing at an exponential rate. According to thismodel, with one year of data and the finest precision at quarter, we will needlog2(365 × 24 × 4) + 1 = 16.1 units of time instead of366 × 24 × 4 =35, 136 units. That is, we will just need 17 time frames to store the compressedinformation.

The third choice is aprogressive logarithmic tilted time frame, where snap-shots are stored at different levels of granularity depending on the recency.Snapshots are put into differentframe numbers, varying from 1 tomax frame,wherelog2(T ) −max capacity ≤ max frame ≤ log2(T ), max capacityis the maximal number of snapshots held in each frame, andT is the clock timeelapsed since the beginning of the stream.

Each snapshot is represented by its timestamp. The rules for insertion of asnapshott (at timet) into the snapshot frame table are defined as follows: (1)if (t mod 2i) = 0 but (t mod 2i+1) 6= 0, t is inserted intoframe numberi if i ≤ max frame; otherwise (i.e.,i > max frame), t is inserted intomax frame; and (2) each slot has amax capacity (which is 3 in our exampleof Fig. 6.3). At the insertion oft into frame number i, if the slot alreadyreaches itsmax capacity, the oldest snapshot in this frame is removed andthe new snapshot inserted. For example, at time 70, since(70 mod 21) = 0but (70 mod 22) 6= 0, 70 is inserted into framenumber1 which knocks outthe oldest snapshot 58 if the slot capacity is 3. Also, at time 64, since(64mod 26) = 0 butmax frame = 5, so 64 has to be inserted into frame 5.Following this rule, when slot capacity is 3, the following snapshots are stored


in the tilted time frame table: 16, 24, 32, 40, 48, 52, 56, 60, 62, 64, 65, 66, 67,68, 69, 70, as shown in Fig. 6.3. From the table, one can see that the closer tothe current time, the denser are the snapshots stored.

In the logarithmic and progressive logarithmic models discussed above, wehave assumed that the base is 2. Similar rules can be applied to any baseα,whereα is an integer andα > 1. The tilted time models shown above aresufficient for usual time-related queries, and at the same time it ensures that thetotal amount of data to retain in memory and/or to be computed is small.

Both the natural tilted frame model and the progressive logarithmic tilted timeframe model provide a natural and systematic way for incremental insertion ofdata in new frames and gradually fading out the old ones. When fading outthe old ones, their measures are properly propagated to theircorrespondingretained timeframe (e.g., from a quarter to its corresponding hour) so that thesevalues are retained in the aggregated form. To simplify our discussion, we willonly use the natural titled time frame model in the following discussions. Themethods derived from this time frame can be extended either directly or withminor modifications to other time frames.

In our data cube design, we assume that each cell in the base cuboid and inan aggregate cuboid contains a tilted time frame, for storing and propagatingmeasures in the computation. This tilted time frame model is sufficient to handleusual time-related queries and mining, and at the same time it ensures that thetotal amount of data to retain in memory and/or to be computed is small.

3.2 Critical layers

Even with thetilted time framemodel, it could still be too costly to dynam-ically compute and store a full cube since such a cube may have quite a fewdimensions, each containing multiple levels with many distinct values. Sincestream data analysis has only limited memory space but requires fast responsetime, a realistic arrangement is to compute and store only some mission-criticalcuboids in the cube.

In our design, two critical cuboids are identified due to their conceptualand computational importance in stream data analysis. We call these cuboidslayers and suggest to compute and store them dynamically. The first layer,calledm-layer, is theminimally interesting layerthat an analyst would like tostudy. It is necessary to have such a layer since it is often neither cost-effectivenor practically interesting to examine the minute detail of stream data. Thesecond layer, calledo-layer, is theobservation layerat which an analyst (or anautomated system) would like to check and make decisions of either signalingthe exceptions, or drilling on the exception cells down to lower layers to findtheir lower-level exceptional descendants.


(*, city, quarter)

(user - group, street - block, minute)

m - layer (minimal interest)

(individual - user, street - address, second)

(primitive) stream data layer

o - layer (observation)

(*, city, quarter)

(user - group, street - block, minute)

m - layer (minimal interest)

(individual - user, street - address, second)

(primitive) stream data layer

o - layer (observation)

Figure 6.4. Two critical layers in the stream cube

Example 3. Assume that “(individual user, streetaddress, second)” forms theprimitive layer of the input stream data in Ex. 1. With thenatural tilted timeframeshown in Figure 6.1, the two critical layers for power supply analysis are:(1) them-layer: (user group, street block,minute), and (2) theo-layer: (∗,city, quarter), as shown in Figure 6.4.

Based on this design, the cuboids lower than them-layer will not need to becomputed since they are beyond the minimal interest of users. Thus the minimalinteresting cells that our base cuboid needs to be computed and stored will bethe aggregate cells computed with grouping byuser group, street block, andminute) . This can be done by aggregations (1) on two dimensions,userand location, by rolling up fromindividual user to user group and fromstreet address to street block, respectively, and (2) on time dimension byrolling up fromsecond tominute.

Similarly, the cuboids at theo-layer should be computed dynamically ac-cording to the tilted time frame model as well. This is the layer that an analysttakes as an observation deck, watching the changes of the current stream databy examining the slope of changes at this layer to make decisions. The layercan be obtained by rolling up the cube (1) along two dimensions to∗ (whichmeansall usercategory) andcity, respectively, and (2) along time dimensionto quarter. If something unusual is observed, the analyst can drill down toexamine the details and the exceptional cells at low levels. ⋄

3.3 Partial materialization of stream cube

Materializing a cube at only two critical layers leaves much room for howto compute the cuboids in between. These cuboids can be precomputed fully,


partially, not at all (i.e., leave everything computed on-the-fly). Let us firstexamine the feasibility of each possible choice in the environment of streamdata. Since there may be a large number of cuboids between these two layersand each may contain many cells, it is often too costly in both space and time tofully materialize these cuboids, especially for stream data. On the other hand,materializing nothing forces all the aggregate cells to be computed on-the-fly,which may slow down the response time substantially. Thus, it is clear thatpartial materialization of a stream cube is a viable choice.

Partial materialization of data cubes has been studied extensively in previousworks, such as ([21, 11]). With the concern of both space and on-linecomputa-tion time, partial computation of dynamic stream cubes poses more challengingissues than its static counterpart: One has to ensure not only the limited pre-computation time and the limited size of a precomputed cube, but also efficientonline incremental updating upon the arrival of new stream data, as well asfast online drilling to find interesting aggregates and patterns. Obviously, onlycareful design may lead to computing a rather small partial stream cube, fastupdating such a cube, and fast online drilling. We will examine how to designsuch a stream cube in the next section.

4. Stream Data Cube Computation

We first examine whether iceberg cube can be an interesting model for par-tially materialized stream cube. In data cube computation,iceberg cube([7])which stores only the aggregate cells that satisfy an iceberg condition has beenused popularly as a data cube architecture since it may substantially reducethesize of a data cube when data is sparse. For example, for a sales data cube, onemay want to only retain the (cube) cells (i.e., aggregates) containing more than2items. This condition is called as aniceberg condition, and the cube containingonly such cells satisfying the iceberg condition is called aniceberg cube. Instream data analysis, people may often be interested in only the substantiallyimportant or exceptional cube cells, and such important or exceptional condi-tions can be formulated as typicaliceberg conditions. Thus it seems that icebergcube could be an interesting model for stream cube architecture. Unfortunately,iceberg cube cannot accommodate the incremental update with the constant ar-rival of new data and thus cannot be used as the architecture of streamcube.We have the following observation.

Observation (No iceberg cubing for stream data) The iceberg cube modeldoes not fit the stream cube architecture. Nor does the exceptional cube model.

Rationale. With the incremental and gradual arrival of new stream data, aswell as the incremental fading of the obsolete data from the time scope of a datacube, it is required that incremental update be performed on such a stream datacube. It is unrealistic to constantly recompute the data cube from scratch upon


incremental updates due to the tremendous cost of recomputing the cube onthe fly. Unfortunately, such an incremental model does not fit the icebergcubecomputation model due to the following observation: Let a cell “〈di, . . . , dk〉 :mik" represent ak − i+ 1 dimension cell withdi, . . . , dk as its correspondingdimension values andmik as its measure value. IfSAT (mik, iceberg cond)is false, i.e.,mik does not satisfy the iceberg condition, the cell is droppedfrom the iceberg cube. However, at a later time slott′, the corresponding cubecell may get a new measurem′

ik related tot′. However, sincemik has beendropped at a previous instance of time due to its inability to satisfy the icebergcondition, the new measure for this cell cannot be calculated correctly withoutsuch information. Thus one cannot use the iceberg architecture to model astream cube unless recomputing the measure from the based cuboid upon eachupdate. Similar reasoning can be applied to the case of exceptional cell cubessince the exceptional condition can be viewed as a special iceberg condition. ⋄

Since iceberg cube cannot be used as a stream cube model, but materializingthe full cube is too costly both in computation time and storage space, wepropose to compute only apopular pathof the cube as our partial computationof stream data cube, as described below.

Based on the notions of the minimal interesting layer (them-layer) and thetilted time frame, stream data can be directly aggregated to this layer accordingto the tilted time scale. Then the data can be further aggregated followingone popular drilling path to reach the observation layer. That is, thepopularpathapproach computes and maintains a single popular aggregation path fromm-layer too-layer so that queries directly on those (layers) along the popularpath can be answered without further computation, whereas those deviatingfrom the path can be answered with minimal online computation from thosereachable from the computed layers. Such cost reduction makes possibletheOLAP-styled exploration of cubes in stream data analysis.

To facilitate efficient computation and storage of the popular path of thestream cube, a compact data structure needs to be introduced so that the spacetaken in the computation of aggregations is minimized. A data structure, calledH-tree, a hyper-linked tree structure introduced in ([20]), is revised and adoptedhere to ensure that a compact structure is maintained in memory for efficientcomputation of multi-dimensional and multi-level aggregations.

We present these ideas using an example.

Example 4. Suppose the stream data to be analyzed contains 3 dimensions,A, B andC, each with 3 levels of abstraction (excluding the highest levelof abstraction “∗”), as (A1, A2, A3), (B1, B2, B3), (C1, C2, C3), where theordering of “∗ > A1 > A2 > A3” forms a high-to-low hierarchy, and so on.The minimal interesting layer (them-layer) is(A2, B2, C2), and theo-layeris (A1, ∗, C1). From them-layer (the bottom cuboid) to theo-layer (the top-


cuboid to be computed), there are in total2× 3× 2 = 12 cuboids, as shown inFigure 6.5.

( A1 , *, C1 )

( A1 , *, C2 ) ( A1 , B1 , C1 ) ( A2 , *, C1 )

( A1 , B1 , C2 ) ( A1 , B2 , C1 ) ( A2 , *, C2 ) ( A2 , B1 , C1 )

( A1 , B2 , C2 ) ( A2 , B2 , C1 )

( A2 , B2 , C2 )

( A2 , B1 , C2 )

Figure 6.5. Cube structure from the m-layer to the o-layer

Suppose that the popular drilling path is given (which can usually be de-rived based on domain expert knowledge, query history, and statisticalanalysisof the sizes of intermediate cuboids). Assume that the given popular path is〈(A1, ∗, C1)→ (A1, ∗, C2)→ (A2, ∗, C2)→ (A2, B1, C2)→ (A2, B2, C2)〉,shown as the dark-line path in Figure 6.5. Then each path of an H-tree fromroot to leaf is ordered the same as the popular path.

This ordering generates a compact tree because the set of low level nodes thatshare the same set of high level ancestors will share the same prefix path usingthe tree structure. Each tuple, which represents the currently in-flow streamdata, after being generalized to them-layer, is inserted into the correspondingpath of the H-tree. An example H-tree is shown in Fig. 6.6. In the leaf nodeof each path, we store relevant measure information of the cells of them-layer.The measures of the cells at the upper layers are computed using the H-treeandits associated links.

An obvious advantage of thepopular path approachis that the nonleaf nodesrepresent the cells of those layers (cuboids) along the popular path. Thus thesenonleaf nodes naturally serve as the cells of the cuboids along the path. That is,it serves as a data structure for intermediate computation as well as the storagearea for the computed measures of the layers (i.e., cuboids) along the path.

Furthermore, the H-tree structure facilitates the computation of other cuboidsor cells in those cuboids. When a query or a drill-down clicking requests tocompute cells outside the popular path, one can find the closest lower levelcomputed cells and use such intermediate computation results to compute the


measures requested, because the corresponding cells can be found via a linkedlist of all the corresponding nodes contributing to the cells. ⋄

root

a 11

c 21 a 21

b 11

b 21 b 21 b 21 b 22 b 22 b 22

b 11

a 12 c 11

c 22

c 12 c 12

c 22 c 21 c 21 c 22 c 22 c 21

c 11

a 22

b 12 b 12

b 22 b 21

Figure 6.6. H-tree structure for cube computation

4.1 Algorithms for cube computation

Algorithms related to stream cube in general handle the following three cases:(1) theinitial computation of (partially materialized) stream cube by popular-path approach, (2) incremental update of stream cube, and (3) online queryanswering with the popular-path-based stream cube.

First, we present an algorithm for computation of (initial) partially material-ized stream cube by popular-path approach.

Algorithm 1(Popular-path-based stream cube computation) Comput-ing initial stream cube, i.e., the cuboids along thepopular-pathbetween them-layer and theo-layer, based on the currently collected set of input streamdata.

Input. (1) multi-dimensional multi-level stream data (which consists of a setof tuples, each carrying the corresponding time stamps), (2) them- ando-layerspecifications, and (3) a given popular drilling path.

Output. All the aggregated cells of the cuboids along the popular path betweenthem- ando-layers.

Method.

1 Each tuple, which represents a minimal addressing unit of multi-dimensionalmulti-level stream data, is scanned once and generalized to them-layer.The generalized tuple is then inserted into the corresponding path of theH-tree, increasing the count and aggregating the measure values of thecorresponding leaf node in the corresponding slot of the tilted time frame.


2 Since each branch of the H-tree is organized in the same order as the spec-ified popular path, aggregation for each corresponding slot in the tiltedtime frame is performed from them-layer all the way up to theo-layer byaggregating along the popular path. The step-by-step aggregation is per-formed while inserting the new generalized tuples in the correspondingtime slot.

3 The aggregated cells are stored in the nonleaf nodes in the H-tree, formingthe computed cuboids along the popular path.

Analysis. The H-tree ordering is based on the popular drilling path given byusers or experts. This ordering facilitates the computation and storage of thecuboids along the path. The aggregations along the drilling path from them-layer to theo-layer are performed during the generalizing of the stream datato them-layer, which takes only one scan of stream data. Since all the cellsto be computed are the cuboids along the popular path, and the cuboids to becomputed are the nonleaf nodes associated with the H-tree, both space andcomputation overheads are minimized. ⋄

Second, we discuss how to perform incremental update of the stream datacube in the popular-path cubing approach. Here we deal with the “always-grow” nature of time-series stream data in an on-line, continuously growingmanner.

The process is essentially an incremental computation method illustratedbelow, using the tilted time frame of Figure 6.1. Assuming that the memorycontains the previously computedm- ando-layers, plus the cuboids along thepopular path, and stream data arrives at every second. The new stream data isaccumulated in the corresponding H-tree leaf nodes. Suppose the time granu-larity of them-layer is minute. At the end of every minute, the accumulateddata will be propagated from the leaf to the corresponding higher level cuboids.When reaching a cuboid whose time granularity is quarter, the rolled measureinformation remains in the corresponding minute slot until it reaches the fullquarter (i.e., 15 minutes) and then it rolls up to even higher levels, and so on.

Notice in this process, the measure in the time interval of each cuboid will beaccumulated and promoted to the corresponding coarser time granularity, whenthe accumulated data reaches the corresponding time boundary. For example,the measure information of every four quarters will be aggregated to one hourand be promoted to the hour slot, and in the mean time, the quarter slots willstill retain sufficient information for quarter-based analysis. This designensuresthat although the stream data flows in-and-out, measure always keeps upto themost recent granularity time unit at each layer.

Third, we examine how an online query can be answered with such a partiallymaterialized popular-path data cube. If a query inquires on the information that


is completely contained in the popular-path cuboids, it can be answered bydirectly retrieving the information stored in the popular-path cuboids. Thus ourdiscussion will focus on the kind of queries that involve the aggregate cellsnotcontained in the popular-path cuboids.

A multi-dimensional multi-level stream query usually provides a few instan-tiated constants and inquires information related to one or a small number ofdimensions. Thus one can consider a query involving a set of instantiated di-mensions,Dci, . . . , Dcj, and a set of inquired dimensions,Dql, . . . , Dqk.The set of relevant dimensions,Dr, is the union of the sets of instantiated di-mensions and the inquired dimensions. For maximal use of the precomputedinformation available in the popular path cuboids, one needs to find the highest-level popular path cuboids that containsDr. If one cannot find such a cuboidin the path, one will have to use thebase cuboidat them-layer to computeit. In either case, the remaining computation can be performed by fetchingthe relevant data set from the so-found cuboid and then computing the cuboidconsisting of the inquired dimensions.

5. Performance Study

To evaluate the effectiveness and efficiency of our proposed streamcube andOLAP computation methods, we performed an extensive performance studyon synthetic datasets. Our result shows that the total memory and computationtime taken by the proposed algorithms are small, in comparison with severalother alternatives, and it is realistic to compute such a partially aggregatedcube, incrementally update them, and perform fast OLAP analysis of streamdata using such precomputed cube.

Besides our experiments on the synthetic datasets, the methods have alsobeen tested on the real datasets in the MAIDS (Mining Alarming Incidents inData Streams) project at NCSA ([10]). The multidimensional analysis engineof the MAID system is constructed based on the algorithms presented in thispaper. The experiments demonstrate similar performance results as reported inthis study.

Here we report our performance studies with synthetic data streams of variouscharacteristics. The data stream is generated by a data generator similar in spiritto the IBM data generator ([5]) designed for testing data mining algorithms. Theconvention for the data sets is as follows:D3L3C10T400K means there are 3dimensions, each dimension contains 3 levels (from them-layer to theo-layer,inclusive), the node fan-out factor (cardinality) is 10 (i.e., 10 children per node),and there are in total 400K mergedm-layer tuples.

Notice that all the experiments are conducted in a static environment as asimulation of the online stream processing. This is because the cube compu-tation, especially for full cube and top-k cube, may take much more time than


32

64

128

256

512

1024

2048

60 80 100 120 140 160 180 200

Ru

ntim

e (

in s

eco

nd

s)

Size (in K tuples)

full-cubingtop-k cubingpopular-path

a) Time vs. size

1

2

4

8

16

32

64

128

60 80 100 120 140 160 180 200

Me

mo

ry S

pa

ce

(in

M-B

yte

s)

Size (in K tuples)


b) Space vs. size

Figure 6.7. Cube computation: time and memory usage vs. # tuples at them-layer for the datasetD5L3C10

the stream flow allows. If this is performed in the online streaming environ-ment, substantial amount of stream data could have been lost due to the slowcomputation of such data cubes. This simulation serves our purpose since itclear demonstrates the cost and the possible delays of stream cubing and indi-cates what could be the realistic choice if they were put in a dynamic streamingenvironment.

All experiments were conducted on a 2GHz Pentium PC with 1 GB mainmemory, running Microsoft Windows-XP Server. All the methods were imple-mented using Sun Microsystems’ Java 1.3.1.

Our design framework has someobvious performance advantages oversome alternatives in a few aspects, including (1)tilted time framevs. full non-tilted time frame, (2)using minimal interesting layervs.examining stream dataat the raw data layer, and (3)computing the cube up to the apex layervs.computing it up to the observation layer. Consequently, our feasibility studywill not compare the design that does not have such advantages since they willbe obvious losers.

Since a data analyst needs fast on-line response, and both space andtimeare critical in processing, we examine both time and space consumption. Inour study, besides presenting the total time and memory taken to compute andstore such a stream cube, we compare the two measures (time and space) ofthepopular pathapproach against two alternatives: (1) thefull-cubingapproach,i.e., materializing all the cuboids between them- ando- layers, and (2) thetop-k cubingapproach, i.e., materializing only the top-k measured cells of thecuboids between them- ando- layers, and we set top-k threshold to be 10%, i.e.,only top 10% (in measure) cells will be stored at each layer (cuboid). Notice


that top-k cubing cannot be used for incremental stream cubing. However,since people may like to pay attention only to top-k cubes, we still put it intoour performance study (as initial cube computation). From the performanceresults, one can see that if top-k cubing cannot compete with the popular pathapproach, with its difficulty at handling incremental updating, it will not likelybe a choice for stream cubing architecture.

0.25

1

4

16

64

256

1024

1 1.5 2 2.5 3 3.5 4 4.5 5

Tim

e (

in S

eco

nd

)

Number of Dimensions


a) Time vs. # dimensions

0.00390625

0.015625

0.0625

0.25

1

4

16

64

1 1.5 2 2.5 3 3.5 4 4.5 5

Me

mo

ry (

in M

byte

s)

Number of Dimensions


b) Space vs. # dimensions

Figure 6.8. Cube computation: time and space vs. # of dimensions for the data setL3C10T100K

The performance results of stream data cubing (cube computation) are re-ported from Figure 6.7 to Figure 6.9.

Figure 6.7 shows the processing time and memory usage for the three ap-proaches, with increasing size of the data set, where the size is measured as thenumber of tuples at them-layer for the data setD5L3C10. Sincefull-cubingandtop-k cubingcompute all the cells from them-layer all the way up to theo-layer, their total processing time is much higher than popular-path. Also,sincefull-cubingsaves all the cube cells, its space consumption is much higherthan popular-path. The memory usage oftop-k cubingfalls in between of thetwo approaches, and the concrete amount will depend on thek value.

Figure 6.8 shows the processing time and memory usage for the three ap-proaches, with an increasing number of dimensions, for the data setL3C10T100K.Figure 6.9 shows the processing time and memory usage for the three ap-proaches, with an increasing number of levels, for the data setD5C10T50K.The performance results show that popular-path is more efficient than both full-cubingandtop-k cubing in computation time and memory usage. Moreover,one can see that increment of dimensions has much stronger impact on thecomputation cost (both time and space) in comparison with the increment oflevels.


32

64

128

256

512

1024

2048

4096

8192

3 3.5 4 4.5 5 5.5 6

Tim

e (

in S

eco

nd

)

Number of Levels


a) Time vs. # levels

1

2

4

8

16

32

64

3 3.5 4 4.5 5 5.5 6

Me

mo

ry (

in M

byte

s)

Number of Levels


b) Space vs. # levels

Figure 6.9. Cube computation: time and space vs. # of levels for the data setD5C10T50K

Since incremental update of stream data cube carries the similar comparativecosts for both popular-path andfull-cubing approaches, and moreover,top-kcubingis inappropriate for incremental updating, we will not present this part ofperformance comparison. Notice that for incrementally computing the newlygenerated stream data,the computation time should be shorter than thatshown here due to less number of cells involved in computation although thetotal memory usage may not reduce due to the need to store data in the layersalong the popular path between two critical layers in the main memory.

Performance study has also been conducted on online query processing,which also shows the superior efficiency of the popular-path approachin com-parison with other alternatives. Thus we conclude that popular-path is aneffi-cient and feasible method for computing multi-dimensional, multi-level streamcubes.

6. Related Work

Our work is related toon-line analytical processing and mining in datacubes, andmanagement and mining of stream data. We briefly review previousresearch in these areas and point out the differences from our work.

In data warehousing and OLAP, much progress has been made on the effi-cient support of standard and advanced OLAP queries in data cubes,includingselective materialization ([21]), cube computation ([7, 20, 30, 27]), cube gra-dient analysis ([23, 13]), exception ([26]), intelligent roll-up ([28]), and high-dimensional OLAP analysis ([24]). However, previous studies do not considerthe support for stream data, which needs to handle huge amount of fast chang-ing stream data and restricts that a data stream can be scanned only once.In


contrast, our work considers complex measures in the form of stream dataandstudies OLAP and mining over partially materialized stream data cubes. Ourdata structure, to certain extent, extend the previous work on H-tree and H-cubing ([20]). However, instead of computing a materialized data cube as inH-cubing, we only use the H-tree structure to store a small number of cuboidsalong the popular path. This will save substantial amount of computation timeand storage space and leads to high performance in both cube computationand query processing. We have also studied whether it is appropriate to useother cube structures, such as star-trees in StarCubing ([30]), dense-sparse par-titioning in MM-cubing ([27]) and shell-fragments in high-dimensional OLAP([24]). Our conclusion is that H-tree is still the most appropriate structuresincemost other structure needs to either scan data set more than once or know thesparse or dense part beforehand, which does not fit the single-scan and dynamicnature of data streams.

Recently, there have been intensive studies on the management and queryingof stream data ([8, 17, 18, 16]), and data mining (classification and clustering)on stream data ([22, 19, 25, 29, 2, 15, 3, 4]). Although such studies lead todeep insight and interesting results on stream query processing and stream datamining, they do not address the issues of multidimensional, online analyticalprocessing of stream data. Multidimensional stream data analysis is an essentialstep to understand the general statistics, trends and outliers as well as other datacharacteristics of online stream data and will play an essential role in streamdata analysis. This study sets a framework and outlines an interesting approachto stream cubing and stream OLAP, and distinguishes itself from the previousworks on stream query processing and stream data mining.

7. Possible Extensions

There are many potential extensions of the work towards comprehensive,high performance analysis of data streams. Here we outline a few.

Disk-based stream cube. Although a stream cube usually retains inmain memory for fast computation, updating, and accessing, it is im-portant to have its important or substantial portionstored or mirroredon disk, which may enhance data reliability and system performance.There are several ways to do it. First, based on the design of the tiltedtime frame, the distant time portion in the data cube can be stored on disk.This may help reduce the total main memory requirement and the updateoverhead. The incremental propagation of data in such distant portioncan be done by other processors using other memory space. Second, toensure that the data is not lost in case of system error or power failure,itis important to keep a mirror copy of the stream data cube on disk. Such amirroring process can be processed in parallel by other processors.Also,


it is possible that a stream cube may miss a period of data due to softwareerror, equipment malfunction, system failure, or other unexpected rea-sons. Thus a robust stream data cube should build the functionality to rundespite the missing of a short period of data in the tilted time frame. Thedata so missed can be treated by special routines, like data smoothing,data cleaning, or other special handling so that the overall stream datacan be interpreted correctly without interruption.

Computing complex measures in stream cubes. Although we didnot discuss the computation ofcomplex measures in the data cubeenvironment, it is obvious that complex measures, such assum,avg,min,max, last, standard deviation, and many other measures can be handledfor the stream data cube in the same manner as discussed in this study.Regression stream cubes can be computed efficiently as indicated in thestudy of ([12]). The distributed and algebraic measures of predictioncubes, as defined in ([9]), in principle, can be computed efficiently inthe data stream environment. However, it is not clear how to efficientlyhandleholistic measures ([14]) in the stream data cubing environment.For example, it is still not clear that how some holistic measures, suchas quantiles, rank, median, and so on, can be computed efficiently in thisframework. This issue is left for future research.

Toward multidimensional online stream mining. This study is onmultidimensional OLAP stream data analysis. Many data mining tasksrequires deeper analysis than simple OLAP analysis, such as classifica-tion, clustering and frequent pattern analysis. In principle, the generalframework worked out in this study, including tilted time frame, minimalgeneralized layer and observation layers, as well as partial precomputa-tion for powerful online analysis, will be useful for in-depth data miningmethods. It is an interesting research theme on how to extend this frame-work towardsonline stream data mining.

8. Conclusions

In this paper, we have promoted on-line analytical processing of streamdata, and proposed a feasible framework for on-line computation of multi-dimensional, multi-level stream cube.

We have proposed a general stream cube architecture and a stream datacubing method for on-line analysis of stream data. Our method uses atiltedtime frame, exploresminimal interesting and observation layers, and adopts apopular path approachfor efficient computation and storage of stream cubeto facilitate OLAP analysis of stream data. Our performance study shows thatthe method is cost-efficient and is a realistic approach based on the currentcomputer technology. Recently, this stream data cubing methodology has been


successfully implemented in the MAIDS project at NCSA (National Centerfor Supercomputing Applications) at the University of Illinois, and tested itseffectiveness using online stream data sets ([10]).

Our proposed stream cube architecture shows a promising direction for real-ization of on-line, multi-dimensional analysis of data streams. There are a lotof issues to be explored further. In particular, it is important to further developdata mining methods to take advantage of stream cubes for on-line mining ofmulti-dimensional knowledge in stream data.

References

[1] S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ra-makrishnan, and S. Sarawagi. On the computation of multidimensionalaggregates. InProc. 1996 Int. Conf. Very Large Data Bases (VLDB’96),pages 506–521, Bombay, India, Sept. 1996.

[2] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for clusteringevolving data streams. InProc. 2003 Int. Conf. Very Large Data Bases(VLDB’03), pages 81–92, Berlin, Germany, Sept. 2003.

[3] C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for projectedclustering of high dimensional data streams. InProc. 2004 Int. Conf. VeryLarge Data Bases (VLDB’04), pages 852–863, Toronto, Canada, Aug.2004.

[4] C. Aggarwal, J. Han, J. Wang, and P. S. Yu. On demand classificationof data streams. InProc. 2004 ACM SIGKDD Int. Conf. KnowledgeDiscovery in Databases (KDD’04), pages 503–508, Seattle, WA, Aug.2004.

[5] R. Agrawal and R. Srikant. Mining sequential patterns. InProc. 1995Int. Conf. Data Engineering (ICDE’95), pages 3–14, Taipei, Taiwan, Mar.1995.

[6] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models andissues in data stream systems. InProc. 2002 ACM Symp. Principles ofDatabase Systems (PODS’02), pages 1–16, Madison, WI, June 2002.

[7] K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse andiceberg cubes. InProc. 1999 ACM-SIGMOD Int. Conf. Management ofData (SIGMOD’99), pages 359–370, Philadelphia, PA, June 1999.

[8] S. Babu and J. Widom. Continuous queries over data streams.SIGMODRecord, 30:109–120, 2001.

[9] B.-C. Chen, L. Chen, Y. Lin, and R. Ramakrishnan. Prediction cubes. InProc. 2005 Int. Conf. Very Large Data Bases (VLDB’05), pages 982–993,Trondheim, Norway, Aug. 2005.


[10] Y. D. Cai, D. Clutter, G. Pape, J. Han, M. Welge, and L. Auvil. MAIDS:Mining alarming incidents from data streams. InProc. 2004 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’04), pages 919–920,Paris, France, June 2004.

[11] S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAPtechnology.SIGMOD Record, 26:65–74, 1997.

[12] Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-dimensionalregression analysis of time-series data streams. InProc. 2002 Int. Conf.Very Large Data Bases (VLDB’02), pages 323–334, Hong Kong, China,Aug. 2002.

[13] G. Dong, J. Han, J. Lam, J. Pei, and K. Wang. Mining multi-dimensionalconstrained gradients in data cubes. InProc. 2001 Int. Conf. on Very LargeData Bases (VLDB’01), pages 321–330, Rome, Italy, Sept. 2001.

[14] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M.Venka-trao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregationoperator generalizing group-by, cross-tab and sub-totals.Data Miningand Knowledge Discovery, 1:29–54, 1997.

[15] C. Giannella, J. Han, J. Pei, X. Yan, and P. S. Yu. Mining frequent patternsin data streams at multiple time granularities. In H. Kargupta, A. Joshi,K. Sivakumar, and Y. Yesha, editors,Data Mining: Next Generation Chal-lenges and Future Directions. AAAI/MIT Press, 2004.

[16] M. Greenwald and S. Khanna. Space-efficient online computation ofquantile summaries. InProc. 2001 ACM-SIGMOD Int. Conf. Managementof Data (SIGMOD’01), pages 58–66, Santa Barbara, CA, May 2001.

[17] A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Surfingwavelets on streams: One-pass summaries for approximate aggregatequeries. InProc. 2001 Int. Conf. on Very Large Data Bases (VLDB’01),pages 79–88, Rome, Italy, Sept. 2001.

[18] J. Gehrke, F. Korn, and D. Srivastava. On computing correlatedaggregatesover continuous data streams. InProc. 2001 ACM-SIGMOD Int. Conf.Management of Data (SIGMOD’01), pages 13–24, Santa Barbara, CA,May 2001.

[19] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clusteringdata streams. InProc. 2000 Symp. Foundations of Computer Science(FOCS’00), pages 359–366, Redondo Beach, CA, 2000.

[20] J. Han, J. Pei, G. Dong, and K. Wang. Efficient computation of icebergcubes with complex measures. InProc. 2001 ACM-SIGMOD Int. Conf.Management of Data (SIGMOD’01), pages 1–12, Santa Barbara, CA, May2001.


[21] V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubesefficiently. InProc. 1996 ACM-SIGMOD Int. Conf. Management of Data(SIGMOD’96), pages 205–216, Montreal, Canada, June 1996.

[22] G. Hulten, L. Spencer, and P. Domingos. Mining time-changing datastreams. InProc. 2001 ACM SIGKDD Int. Conf. Knowledge Discoveryin Databases (KDD’01), San Fransisco, CA, Aug. 2001.

[23] T. Imielinski, L. Khachiyan, and A. Abdulghani. Cubegrades: Generaliz-ing association rules.Data Mining and Knowledge Discovery, 6:219–258,2002.

[24] X. Li, J. Han, and H. Gonzalez. High-dimensional OLAP: A minimal cub-ing approach. InProc. 2004 Int. Conf. Very Large Data Bases (VLDB’04),pages 528–539, Toronto, Canada, Aug. 2004.

[25] G. Manku and R. Motwani. Approximate frequency counts over datastreams. InProc. 2002 Int. Conf. Very Large Data Bases (VLDB’02),pages 346–357, Hong Kong, China, Aug. 2002.

[26] S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven explorationof OLAP data cubes. InProc. Int. Conf. of Extending Database Technology(EDBT’98), pages 168–182, Valencia, Spain, Mar. 1998.

[27] Z. Shao, J. Han, and D. Xin. MM-Cubing: Computing iceberg cubesby factorizing the lattice space. InProc. 2004 Int. Conf. on Scientific andStatistical Database Management (SSDBM’04), pages 213–222, SantoriniIsland, Greece, June 2004.

[28] G. Sathe and S. Sarawagi. Intelligent rollups in multidimensional OLAPdata. InProc. 2001 Int. Conf. Very Large Data Bases (VLDB’01), pages531–540, Rome, Italy, Sept. 2001.

[29] H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-drifting datastreams using ensemble classifiers. InProc. 2003 ACM SIGKDD Int.Conf. Knowledge Discovery and Data Mining (KDD’03), pages 226–235,Washington, DC, Aug. 2003.

[30] D. Xin, J. Han, X. Li, and B. W. Wah. Star-cubing: Computing icebergcubes by top-down and bottom-up integration. InProc. 2003 Int. Conf.Very Large Data Bases (VLDB’03), pages 476–487, Berlin, Germany,Sept. 2003.

[31] Y. Zhao, P. Deshpande, and J. Naughton. An array-based algorithm forsimultaneous multi-dimensional aggregates. InProc. ACM-SIGMOD In-ternational Conference on Management of Data, pages 159-170, 1997.

Chapter 7

LOAD SHEDDING IN DATA STREAM SYSTEMS

Brian BabcockDepartment of Computer ScienceStanford University

[email protected]

Mayur DatarGoogle, Inc.

[email protected]

Rajeev MotwaniDepartment of Computer ScienceStanford University

[email protected]

Abstract Systems for processing continuous monitoring queries over data streams must beadaptive because data streams are often bursty and data characteristics may varyover time. In this chapter, we focus on one particular type of adaptivity: the abilityto gracefully degrade performance via “load shedding” (dropping unprocessedtuples to reduce system load) when the demands placed on the system cannotbe met in full given available resources. Focusing on aggregation queries, wepresent algorithms that determine at what points in a query plan should loadshedding be performed and what amount of load should be shed at each point inorder to minimize the degree of inaccuracy introduced into query answers. Wealso discuss strategies for load shedding for other types of queries (set-valuedqueries, join queries, and classification queries).

Keywords: data streams, load shedding, adaptive query processing, sliding windows, auto-nomic computing

One of the main attractions of a streaming mode of data processing — asopposed to the more conventional approach for dealing with massive data sets,


in which data are periodically collected and analyzed in batch mode — is thetimeliness of the insights that are provided. In many cases, the ability toissue continuous queries and receive real-time updates to the query answersas new data arrives can be the primary motivation for preferring data streamprocessing technology to alternatives such as data warehousing. For this reason,it is important that data stream processing system be able to continue to providetimely answers even under difficult conditions, such as when temporary burstsin data arrival rates threaten to overwhelm the capabilities of the system.

Many data stream sources (for example, web site access patterns, transac-tions in financial markets, and communication network traffic) are prone todramatic spikes in volume (e.g., spikes in traffic at a corporate web followingthe announcement of a new product or the spikes in traffic experiencedby newsweb sites and telephone networks on September 11, 2001). Because peak loadduring a spike can be orders of magnitude higher than typical loads, fully pro-visioning a data stream monitoring system to handle the peak load is generallyimpractical. However, in many monitoring scenarios, it is precisely duringbursts of high load that the function performed by the monitoring applicationis most critical. Therefore, it is particularly important for systems processingcontinuous monitoring queries over data streams to be able to automaticallyadapt to unanticipated spikes in input data rates that exceed the capacity ofthesystem. An overloaded system will be unable to process all of its input dataand keep up with the rate of data arrival, soload shedding, i.e., discarding somefraction of the unprocessed data, becomes necessary in order for thesystem tocontinue to provide up-to-date query responses. In this chapter, we considerthe question of how best to perform load shedding: How many tuples shouldbedropped, and where in the query plan should they be dropped, so that the systemis able to keep up with the rate of data arrival, while minimizing the degree ofinaccuracy in the query answers introduced as a result of load shedding?

The answer to this question often differs depending on the type of queriesbeing answered, since different classes of queries have differentloss metricsfor measuring the degradation in answer quality caused by load shedding.Inthe first part of the chapter, we perform a detailed study of the load sheddingproblem for one particular class of queries, sliding window aggregate queries.Afterwards, we consider several other classes of queries (set-valued querieswith tuple-level utility functions, sliding-window join queries, and classificationqueries), and we briefly discuss load shedding techniques appropriatefor eachquery class.

1. Load Shedding for Aggregation Queries

The continuous monitoring queries that we consider in this section areslidingwindow aggregate queries, possibly including filters and foreign-key joins with

Load Shedding in Data Stream Systems 129

stored relations, over continuous data streams. We restrict out attention tothis query class, omitting monitoring queries involving joins between multiplestreams, or non-foreign-key joins between streams and stored relations.

Overview of Approach In this section, we will describe a technique involvingthe introduction of load shedding operators, orload shedders, at various pointsin the query plan. Each load shedder is parameterized by a sampling ratep. Theload shedder flips a coin for each tuple that passes through it. With probability p,the tuple is passed on to the next operator, and with probability1−p, the tuple isdiscarded. To compensate for the lost tuples caused by the introduction ofloadshedders, the aggregate values calculated by the system are scaled appropriatelyto produce unbiased approximate query answers.

The decisions about where to introduce load shedders and how to set thesam-pling rate for each load shedder are based on statistics about the data streams,including observed stream arrival rates and operator selectivities. Weuse statis-tical techniques similar to those used in approximate query processing systemsto make these decisions in such a way as to achieve the best attainable accuracygiven data input rates.

1.1 Problem Formulation

Preliminaries. For our purposes, acontinuous data streamS will be de-fined as a potentially unbounded sequence of tupless1, s2, s3, . . . that arriveover time and must be processed online as they arrive. Asliding window ag-gregateis an aggregation function applied over a sliding window of the mostrecently-arrived data stream tuples (for example, a moving average). The aggre-gation functions that we consider are SUM and COUNT, though the techniquesdescribed can be generalized to other functions such as AVG and MEDIAN.Sliding windows may be eithertime-based, meaning that the window consistsof all tuples that have arrived within some time intervalw of the present (e.g.,the last 4 hours), ortuple-based, meaning that the window consists of theNmost recently arrived tuples (e.g., the last 10,000 tuples). Afilter is a localselection condition on tuples from a data stream.

This class of queries (sliding window aggregate queries) is important anduseful for many data stream monitoring applications, including network trafficengineering, which we will use as an example application domain. Networkanalysts often monitor sliding window aggregates covering multiple timescalesover packet traces from routers, typically filtering based on the internetprotocolused, source and destination port numbers, and similar considerations. Foreign-key joins or semijoins with stored relations may be used in monitoring queriesto perform filtering based on some auxiliary information that is not stored inthe data stream itself (e.g., the industry grouping for a security in a financial


Statistics

Manager

Load

Manager

Statistics

Σ Σ ΣΣ Σ

Shared

Segment

Σ

S1 S2

Q1 Q6Q5Q4Q3Q2

Si

Σ

Query Operator

Load Shedder

Data Stream

Aggregate Operator

Figure 7.1. Data Flow Diagram

monitoring application). For our purposes, such joins have the same structureand effects as an expensive selection predicate or a user-defined function.

Most data stream monitoring scenarios involve multiple concurrent continu-ous queries. Sharing of common sub-expressions among queries is desirable toimprove the scalability of the monitoring system. For this reason, it is importantthat a load shedding policy take into account the structure of operator sharingamong query plans rather than attempting to treat each query as an isolated unit.

The input to the load shedding problem consists of a set of queriesq1, . . . , qnover data streamsS1, . . . , Sm, a set of query operatorsO1, . . . , Ok, and someassociated statistics that are described below. The operators are arranged intoadata flow diagram(similar to [3]) consisting of a directed acyclic graph withm source nodes representing the data streams,n sink nodes representing thequeries, andk internal nodes representing the query operators. (Please refer toFigure 7.1.) The edges in the graph represent data flow between query operators.For each queryqi, there is a corresponding path in the data flow diagram fromsome data streamSj though a set of query operatorsOi1 , Oi2 , . . . , Oip to nodeqi. This path represents the processing necessary to compute the answer toquery qi, and it is called thequery pathfor query qi. Because we do notconsider joins between data streams, the data flow diagram can be thought ofas being composed of a set of trees. The root node of each tree is a datastreamSj , and the leaf nodes of the tree are the queries that monitor streamSj . LetT (Sj) denote the tree of operators rooted at stream sourceSj .


Every operatorOi in the data flow diagram is associated with two parameters:its selectivitysi and its processing time per tupleti. The selectivity of anoperator is defined as the ratio between the number of output tuples producedby the operator and the number of input tuples consumed by the operator. Theprocessing time per tuple for an operator is defined as the average amountoftime required by the operator to process each input tuple. The last operator alongany query path is a windowed aggregate operator. The output of this operator isthe final answer to the query and therefore not consumed by any other operator,so the selectivity of such an operator can be considered to be zero. Each SUMaggregate operatorOi is associated with two additional parameters, the meanµi and standard deviationσi of the values in the input tuples that are beingaggregated. The final parameters to the load shedding problem are the rateparametersrj , one for each data streamSj . Rate parameterrj represents theaverage rate of tuple arrival on streamSj , measured in tuples per unit time.

Estimation of Input Parameters. Although we have described theseinput parameters (selectivity, stream rate, etc.) as known, fixed quantities, inreality their exact values will vary over time and cannot be known preciselyin advance. In the data stream management system STREAM [8], there is aStatistics Manager module that estimates the values of these parameters. Thequery operators in STREAM are instrumented to report statistics on the numberof tuples processed and output by the operator and the total processortimedevoted to the operator. Based on these statistics, the Statistics Manager canestimate the selectivity and processing times of operators as well as the datastream arrival rates. During times when statistics gathering is enabled, theSUM aggregation operator additionally maintains statistics on the sum andsum-of-squares of the aggregate values of tuples that it processes, allowing theestimation of the mean and standard deviation of the values of the attributebeing summed. As stream arrival rate and data characteristics change, theappropriate amount of load to shed and the right places to shed it may changeas well. Therefore, in the STREAM system, estimates for the load sheddinginput parameters are periodically refreshed by the Statistics Manager, and loadshedding decisions are periodically revisited.

Accuracy Metric. LetA1, A2, . . . , An be the answers to queriesq1, q2, . . . , qnat some point in time, and letA1, A2, . . . , An be the answers produced by thedata stream monitoring system. If the input rates are high enough that loadshedding becomes necessary, the data stream monitoring system may not beable to produce the correct query answers, i.e.,Ai 6= Ai for some or all queriesqi. The quality of a load shedding policy can be measured in terms of the devi-ation of the estimated answers produced by the system from the actual answers.Since the relative error in a query answer is generally more important than the


absolute magnitude of the error, the goal of our load shedding policy will beto minimize the relative error for each query, defined asǫi = |Ai − Ai|/|Ai|.Moreover, as there are multiple queries, we aim to minimize the maximum erroracross all queries,ǫmax = max1≤i≤n ǫi.

Load Constraint. The purpose of load shedding is to increase the through-put of the monitoring system so that the rate at which tuples are processed isat least as high as the rate at which new input tuples are arriving on the datastreams. If this relationship does not hold, then the system will be unable to keepup with the arriving data streams, and input buffers and latency of responseswill grow without bound. We capture this requirement in an equation, whichwe call theload equation, that acts as a constraint on load shedding decisions.

Before presenting the load equation, we will first introduce some additionalnotation. As mentioned earlier, each operatorOi is part of some tree of oper-atorsT (Sj). LetUi denote the set of operators “upstream” ofOi—that is, theoperators that fall on the path fromSj toOi in the data flow diagram. If some ofthe operators upstream ofOi are selective, the data input rate seen by operatorOi will be less than the data stream raterj at the stream source since sometuples are filtered out before reachingOi. Furthermore, if load shedders areintroduced upstream ofOi, they will also reduce the effective input rate seen byOi. Let us definepi as the sampling rate of the load shedder introduced immedi-ately before operatorOi and letpi = 1 when no such load shedder exists. Thusto measure the time spent in processing operatorOi, we are interested in theeffective input ratefor Oi, which we denoter(Oi) = rsrc(i)pi

∏Ox∈Ui

sxpx.(Heresrc(i) denotes the index of the data stream source for operatorOi, i.e.src(i) = j for Oi ∈ T (Sj).) This leads to the load equation:

Equation 1.1 (Load Equation) Any load shedding policy must selectsampling ratespi to ensure:

∑

1≤i≤k

tirsrc(i)pi

∏

Ox∈Ui

sxpx

≤ 1 (7.1)

The left hand side of Equation 1.1 gives the total amount of time required forthe system to process the tuples that arrive during one time unit, assuming thatthe overhead introduced by load shedding is negligible. Clearly, this processingtime can be at most one time unit, or else the system will be unable to keep upwith the arriving data streams. The assumption that the cost of load sheddingissmall relative to the cost of query operators, and can therefore be safely ignored,is borne out by experimental evidence [2].

Problem Statement. The formal statement of the load shedding problemis as follows:Given a data flow diagram, the parameterssi, ti, µi, σi for each


operatorOi, and the rate parametersrj for each data streamSj , select loadshedding sampling ratespi to minimize the maximum relative errorǫmax =max1≤i≤n ǫi, subject to the constraint that the load equation, Equation 1.1,must be satisfied.

1.2 Load Shedding Algorithm

In this section, we describe our algorithm for determining the locations atwhich load shedding should be performed and setting the sampling rate param-eterspi. The algorithm has two steps:

1 Determine the effective sampling rates for each query that will distributeerror evenly among all queries.

2 Determine where in the data flow diagram load shedding should be per-formed to achieve the appropriate rates and satisfy the load equation.

These two steps are described in detail below.

Allocation of Work Among Queries. Recall that the error metric weuse to measure the accuracy of a query response is the relative error.It isimpossible to precisely predict the relative error in query answers that willarisefrom a particular choice of a load shedding policy, because the data values inthe discarded tuples are unknown. However, if we assume some knowledgeabout the distribution of data values, for example based on previously-seentuples from the data streams, then we can use probabilistic techniques to getgood estimates of what the relative error will be. There is some variability inthe relative error, even if the data distribution is known exactly, because theapproximate answers produced by the system depend on the outcomes of therandom coin flips made by the load shedders. Therefore, to compare alternativeload shedding policies, we do the following: for a fixed small constantδ (weuse 0.01), we say that a load shedding policy achieves errorǫ if, for each queryqi, the relative error resulting from using the policy to estimate the answer toqiexceedsǫ with probability at mostδ.

Relating Sampling Rate and Error Suppose the query path for a SUMqueryqi consists of the sequence of operatorsOi1, Oi2, . . . , Oiz. Consider aload shedding policy that introduces load shedders along the query path withsampling ratespi1, pi2, . . . , piz. Letτ be a tuple that would pass through all thequery operators and contribute to the query answer in the absence of any loadshedders. When load shedders are present,τ will contribute to the answer if andonly if it passes through all the load shedders, which occurs with probabilityPi = pi1pi2 . . . piz. We will refer toPi as theeffective sampling ratefor queryqi.


LetQi denote the set of tuples from the current sliding window that wouldpass all selection conditions and contribute to the query answer in the absenceof load shedders. LetNi be the number of tuples in the setQi. From theabove discussion, it is clear that in the presence of load shedders, this aggregatequery will be answered based on a sample ofQi where each element getsincluded independently with probabilityPi. For the tuples in the setQi, letv1, v2, . . . , vNi denote the values of the attribute being summed, and letAi betheir sum. The approximate answerAi produced by the system will be the sumof vi’s for the tuples that get included in the sample, scaled by the inverse ofthe effective sampling rate (1/Pi). The following proposition, which followsdirectly from a result due to Hoeffding (Theorem 2 in [6]), gives an upper boundon the probability that the relative error exceeds a certain thresholdǫi.

Proposition 1.1 Let X1, X2, . . . , XN beN random variables, such thateach random variableXj takes the valuevj/P with probability P and thevalue zero otherwise. LetAi be the sum of these random variables and letAi =

∑Nj=1 vj . If we denote bySSi the sum

∑Nj=1 v

2j , then

Pr|Ai −Ai| ≥ ǫ|Ai| ≤ 2 exp(−2P 2ǫ2A2

i /SSi

)

Thus, for a queryqi, to ensure that the probability that the relative errorexceedsǫi is at mostδ, we must guarantee that2 exp

(−2P 2

i ǫ2iA

2i /SSi

)≤ δ,

which occurs whenPiǫi ≥ Ci, where we defineCi =√

SSi

2A2ilog 2

δ . Let-

ting the mean and variance of the valuesv1, v2, . . . , vNi be denoted byµi =∑Nij=1 vj/Ni andσ2

i =∑Ni

j=1(vj − µi)2/Ni, respectively, the ratioSSi/A

2i is

equal to(σ2i +µ2

i )/(Niµ2i ). Thus the right-hand side of the preceding inequality

reduces toCi =

√σ2

i +µ2i

2Niµ2i

log 2δ .

If we want a load shedding policy to achieve relative errorǫi, we mustguarantee thatPi ≥ Ci/ǫi. Thus, to setPi correctly, we need to estimateCi.Recall that we are given estimates forµi andσi (provided by the StatisticsManager) as inputs to the load shedding problem. The value ofNi can becalculated from the size of the sliding window, the estimated selectivities ofthe operators in the query path forqi, and (in the case of time-based slidingwindows) the estimated data stream raterj .

The larger the value ofCi, the larger the effective sampling ratePi needsto be to achieve a fixed errorǫi with a fixed confidence boundδ. Clearly,Ci

is larger for queries that are more selective, for queries over smaller slidingwindows, and for queries where the distribution of values for the attribute beingsummed is more skewed. For a COUNT aggregate,µi = 1 andσi = 0, so onlythe window size and predicate selectivity affect the effective sampling rate.


On the Effects of Parameter Estimation Errors Since the values of the pa-rameters that affect the effective sampling rate are known only approximately,and they are subject to change over time, using the estimated parameter val-ues directly to calculate effective sampling rates may result in under-samplingfor a particular query, causing higher relative error. For example, if the datacharacteristics change so that an attribute that previously had exhibited littleskew suddenly becomes highly skewed, the relative error for a query whichaggregates the attribute is likely to be higher than predicted.

The impact of a mistake in estimating parameters will be more pronouncedfor a query whosePi is low than for a query with higherPi. Therefore, forapplications where rapid changes in data characteristics are of concern, a moreconservative policy for setting effective sampling rates could be implementedbyadding a constant “fudge factor” to the estimates forCi for each query. In effect,this would result in resources being distributed among the queries somewhatmore evenly than would be optimal based on the estimated parameter values.Such a modification would misallocate resources somewhat if the estimatedparameters turn out to be correct, but it would be more forgiving in the case ofsignificant errors in the estimates.

Choosing Target Errors for Queries The objective that we seek to minimizeis the maximum relative errorǫi across all queriesqi. It is easy to see that theoptimal solution will achieve the same relative errorǫ for all queries.

Observation 1.2 In the optimal solution, the relative error (ǫi) is equal forall queries for which load shedding is performed.

Proof: The proof is by contradiction. Suppose thatǫi < ǫj for two queriesqi, qj . Sinceǫi = Ci/Pi < ǫj , we could reducePi to P ′

i by introducing a loadshedder before the final aggregation operator forqi with effective samplingrateP ′

i/Pi so thatǫ′i = Ci/P′i = ǫj . By doing so, we keep the maximum

relative error unchanged but reduce the processing time, gaining some slack inthe load equation. This slack can be distributed evenly across all queries byincreasing all load shedder sampling rates slightly, reducing the relative errorfor all queries.

For an optimal solution, since the relative errors for all queries are the same,the effective sampling ratePi for each queryqi will be proportional to theCi

value for that query, sincePi = Ci/ǫi = Ci/ǫmax. Therefore, the problem ofselecting the best load shedding policy reduces to determining the best achiev-ableǫmax and inserting load shedders such that, for each queryqi, the effectivesampling ratePi, is equal toCi/ǫmax. In doing so we must guarantee thatthe modified query plan, after inserting load shedders, should satisfy the loadequation (Equation 1.1).


Placement of Load Shedders. For now, assume that we have guessed theright value ofǫmax, so that we know the exact effective sampling ratePi foreach query. (In fact, this assumption is unnecessary, as we will explain below.)Then our task is reduced to solving the following problem:Given a data flowdiagram along with a set of target effective sampling ratesPi for each queryqi,modify the diagram by inserting load shedding operators and set their samplingrates so that the effective sampling rate for each queryqi is equal toPi and thetotal processing time is minimized.

If there is no sharing of operators among queries, it is straightforward tosee that the optimal solution is to introduce a load shedder with sampling ratepi = Pi before the first operator in the query path for each queryqi. Introducinga load shedder as early in the query path as possible reduces the effectiveinput rate for all “downstream” operators and conforms to the general queryoptimization principle of pushing selection conditions down.

Introducing load shedders and setting their sampling rates is more compli-cated when there is sharing among query plans. Suppose that two queriesq1 andq2 share the first portion of their query paths but have different effective sam-pling rate targetsP1 andP2. Since a load shedder placed at the shared beginningof the query path will affect the effective sampling rates for both queries, it isnot immediately clear how to simultaneously achieve both effective samplingrate targets in the most efficient manner, though clearly any solution will nec-essarily involve the introduction of load shedding at intermediate points in thequery paths.

We will define ashared segmentin the data flow diagram as follows: Supposewe label each operator with the set of all queries that contain the operatorintheir query paths. Then the set of all operators having the same label is a sharedsegment.

Observation 1.3 In the optimal solution, load shedding is only performedat the start of shared segments.

This observation is true for the same reason that load shedding should alwaysbe performed at the beginning of the query plan when no sharing is present:The effective sampling rates for all queries will be the same regardless oftheposition of the load shedder on the shared segment, but the total execution timewill be smallest when the load shedding is performed as early as possible.

The preceding observation rules out some types of load shedding configura-tions, but it is not enough to determine exactly where load shedding should beperformed. The following simple example will lead us to a further observationabout the structure of the optimal solution:

Example 7.1 Consider a simple data flow diagram with 3 operators as shownin Figure 7.2. Suppose the query nodesq1 andq2 must have effective sampling


A

B C

S

P1=0.5 P2=0.8

p2 p3

p1

A

B C

S

P1=0.5 P2=0.8

0.625

0.8

Figure 7.2. Illustration of Example 7.1

rates equal to0.5 and0.8 respectively. Each operator (A, B, and C) is in itsown shared segment, so load shedding could potentially be performed beforeany operator. Imagine a solution that places load shedders before all threeoperators A, B, and C with sampling ratesp1, p2, andp3 respectively. Sincep1p2 = 0.5 andp1p3 = 0.8, we know that the ratiop2/p3 = 0.5/0.8 = 0.625 inany solution. Consider the following modification to the solution: eliminate theload shedder before operatorC and change the sampling rates for the other twoload shedders to bep′1 = p1p3 = 0.8 andp′2 = p2/p3 = 0.625. This changedoes not affect the effective sampling rates, becausep′1p

′2 = p1p2 = 0.5 and

p′1 = p1p3 = 0.8, but the resulting plan has lower processing time per tuple.Effectively, we have pushed down the savings from load shedderp3 to beforeoperator A, thereby reducing the effective input rate to operator A while leavingall other effective input rates unchanged.

Let us define abranch pointin a data flow diagram as a point where oneshared segment ends by splitting intok > 1 new shared segments. We willcall the shared segment terminating at a branch point theparent segmentandthek shared segments originating at the branch pointchild segments. We cangeneralize the preceding example as follows:

Observation 1.4 Let qmax be the query that has the highest effective sam-pling rate among all queries sharing the parent segment of a branch point B.In the optimal solution, the child segment ofB that lies on the query path forqmax will not contain a load shedder. All other child segments ofB will containa load shedder with sampling ratePchild/Pmax, whereqchild is defined for eachchild segment as the query with the highest effective sampling rate among thequeries sharing that child segment.


P1=0.1 P2=0.3

B

S

P3=0.25 P4=0.5 P5=0.4

Pmax=0.5

=0.6

=0.5

0.3

0.50.25

0.5

No load

shedder

qmax

Figure 7.3. Illustration of Observation 1.4

Observation 1.4 is illustrated in Figure 7.3. The intuition underlying thisobservation is that, since all queries sharing the parent segment must shed atleast a(1− Pmax)-fraction of tuples, that portion of the load shedding shouldbe performed as early as possible, no later than the beginning of the sharedsegment. The same intuition leads us to a final observation that completes ourcharacterization of the optimal load shedding solution. Let us refer to a sharedsegment that originates at a data stream as aninitial segment.

Observation 1.5 Let qmax be the query that has the highest effective sam-pling rate among all queries sharing an initial segmentS. In the optimalsolution,S will contain a load shedder with sampling ratePmax.

The combination of Observations 1.3, 1.4, and 1.5 completely specifies theoptimal load shedding policy. This policy can be implemented using a simpletop-down algorithm. If we collapse shared segments in the data flow diagraminto single edges, the result is a set of trees where the root node for each treeis a data streamSj , the internal nodes are branch points, and the leaf nodesare queries. We will refer to the resulting set of trees as thecollapsed treerepresentationof the data flow diagram. For any internal nodex in the collapsedtree representaiton, letPx denote the maximum over all the effective samplingratesPi corresponding to the leaves of the subtree rooted at this node.

The following definition will be useful in the proof of Theorem 1.7.

Definition 1.6 Theprefix path probabilityof a nodex in the collapsed treerepresentation is defined as the product of the sampling rates of all the load


Algorithm 1 ProcedureSetSamplingRate(x, Rx)

if x is a leaf nodethenreturn

end ifLet x1, x2, . . . xk be the children ofxfor i = 1 to k do

if Pxi< Rx then

Shed load withp = Pxi/Rx on edge(x, xi)

end ifSetSamplingRate(xi, Pxi

)end for

Figure 7.4. ProcedureSetSamplingRate(x,Rx)

shedders on the path from nodex to the root of its tree. If there are no loadshedders between the root and nodex, then the prefix path probability ofx is 1.

The pseudocode in Algorithm 7.4 operates over the collapsed tree represen-tation to introduce load shedders and assign sampling rates starting with thecall SetSamplingRate(Sj , 1) for each data streamSj .

Theorem 1.7 Among all possible choices for the placement of load sheddersand their sampling rates which result in a given set of effective sampling ratesfor the queries, the solution generated by theSetSamplingRate procedurehas the lowest processing time per tuple.

Proof: Note that in each recursive invocation ofSetSamplingRate(x,Rx),the second parameterRx is equal to the prefix path probability of nodex. Toprove the theorem, we first prove the claim that for each nodex other than theroot, the prefix path probability ofx is equal toPx.

The proof of the claim is by induction on the height of the tree. The basecase consists of the root node and its children. The claim is trivially true fortheroot node. For a noden that is the child of the root, the top-level invocation ofSetSamplingRate, withRroot = 1, places a load shedder with sampling ratePn/Rroot = Pn at the beginning of edge(root, n), so the prefix path probabilityof n is equal toPn.

For the inductive case, consider any nodeb in the tree which is the childof some non-root nodea. Assume that the claim holds for nodea. WhenSetSamplingRate is called witha as an argument, it places a load shedderwith sampling ratePb/Pa at the beginning of edge(a, b). Thus, by the inductivehypothesis, the product of sampling rates of load shedders from the root to nodeb equalsPa × Pb

Pa= Pb, proving the claim.


Thus we guarantee that the prefix path probability of any node is equal tothe highest effective sampling rate of any query which includes that nodein itsquery path. No solution could set a prefix path probability less than this valuesince it would otherwise violate the effective sampling rates for that query.Thusthe effective input rate of each operator is the minimum that can be achievedsubject to the constraint that prefix path probabilities at the leaf nodes shouldequal the specified effective sampling rates. This proves the optimality of thealgorithm.

Determining the Value of ǫmax An important point to note about the algo-rithm is that except for the first load shedder that is introduced just aftertheroot node, the sampling rates for all others depend only on the ratios betweeneffective sampling rates (each sampling rate is equal toPi/Pj = Ci/Cj forsomei, j) and not on the actualPi values themselves. As a consequence, it isnot actually necessary for us to know the value ofǫmax in advance. Instead, wecan express each effective sampling ratePi asCiλ, whereλ = 1/ǫmax is an un-known multiplier. On each query path, there is at most one load shedder whosesampling rate depends onλ, and therefore the load equation becomes a linearfunction ofλ. After running Algorithm 7.4, we can easily solve Equation 1.1for the resulting configuration to obtain the correct value ofλ that makes theinequality in Equation 1.1 tight.

Another consequence of the fact that only load shedders on initial segmentsdepend on the actualPi values is that the load shedding structure remains stableas the data stream arrival ratesrj change. The effective sampling ratePi foreach queryqi over a given data streamSj depends on the raterj in the sameway. Therefore, changingrj does not affect the ratio between thePi values forthese queries. The only impact that a small change torj will have is to modifythe sampling rates for the load shedders on the initial segments.

When determiningǫmax in situations when the system load is only slightlyabove system capacity, an additional consideration sometimes needs to be takeninto account: When no load shedding is performed along the query path foragiven query, the error on that query drops to zero. By contrast, foreach query,there is a minimum error threshold (Ci) below which no error guarantees basedon Proposition 1.1 can be given as long asany load shedding is performedalong the query path. As the effective sampling ratePi increases, the relativeerrorǫi decreases continuously whilePi < 1 then makes a discontinuous jump(from ǫi = Ci to ǫi = 0) atPi = 1. Our algorithm can be easily modified toincorporate this discontinuity, as described in the next paragraph.

In some cases, the value ofλ that makes the inequality in Equation 1.1tight may be greater than1/Cmax, whereCmax is the proportionality constant(derived using Proposition 1.1) of the queryqmax with maximum target effectivesampling rate. Such a value ofλ corresponds to an infeasible target effective


sampling rate for queryqmax, sincePmax = Cmaxλ > 1. It is not meaningfulto have a load shedder with sampling rate greater than one, so the maximumpossible effective sampling rate for any query is 1, which is attained when noload shedding is performed for that query. To handle this case, we setPmax = 1and re-compute the placement of load shedders using theSetSamplingRate

procedure (Algorithm 7.4). This re-computation may be need to be performedseveral times—each time forcing an additional query’s target sampling rateequal to 1—until eventuallyPi ≤ 1 for all queriesqi.

1.3 Extensions

We briefly discuss how to extend our techniques to incorporate quality ofservices guarantees and a more general class of queries.

Quality of Service. By taking as our objective the minimization of the maxi-mum relative error across all queries, we have made the implicit assumption thatall queries are equally important. In reality, in many monitoring applicationssome queries can be identified as being more critical than others. Our techniquescan easily be adapted to incorporate varying quality of service requirements fordifferent queries, either through the introduction of query weights, or querypriorities, or both.

One modification would be to allow users to associate a weight or importancewi with each queryqi. With weighted queries, the goal of the system is tominimize the maximumweightedrelative error. When computing the effectivesampling rate target for the queries, instead of ensuring thatCi/ǫmax is equalfor all queriesqi, we ensure thatCi/(wiǫmax) is equal. In other words, insteadof Pi ∝ Ci we havePi ∝ Ciwi.

An alternative way of specifying query importance is to assign a discretepriority level to each query. Then the goal of the system is to minimize themaximum relative error across all queries of the highest priority level. If allthese queries can be answered exactly, then the system attempts to minimize themaximum relative error across queries with the second-highest priority level,and so on.

More General Query Classes. We have discussed the load sheddingproblem in the context of a particular class of data stream monitoring queries,aggregation queries over sliding windows. However, the same techniquesthatwe have developed can be applied to other classes of queries as well. Oneexample is monitoring queries that have the same structure as the ones we havestudied, except that they have set-valued answers instead of ending withanaggregation operator. In the case of set-valued queries, an approximate answerconsists of a random sample of the tuples in the output set. The metric ofrelative error is not applicable to set-valued queries. Instead, we can measure


error as the percentage of tuples from the query answer that are missingin theapproximate answer. The goal of the system is to minimize the maximum valueof this quantity across all queries, optionally with query weights or priorities.Our algorithm can be made to optimize for this objective by simply settingCi

for each query equal to1.Another class of queries that arises in data stream monitoring applications is

aggregation queries with “group-bys”. One can view a group-by query as multi-ple queries, one query for each group. However, all these queries share the entirequery path and thus will have the same effective sampling rate. Consequently,the group with maximum relative error will be the one with the maximumCi

value. Since our error metric is the maximum relative error among all groupsacross queries, within each group-by query, the group with maximumCi valuewill be the only group that counts in the design of our solution. Thus, we cantreat the group with maximumCi value as the representative group for thatquery.

Incorporating Load Shedding Overhead. The results we have presentedare based on the assumption that the cost (in terms of processing time) toperform load shedding is small relative to the the cost of query operators. In anactual system implementation, even simple query operators like basic selectionsgenerally have considerable overhead associated with them. A load shedder,on the other hand, involves little more than a single call to a random numbergenerator and thus can be very efficiently implemented. In empirical tests usingthe STREAM system, we found that the processing time per tuple for a loadshedding operator was only a small fraction of the total processing time pertuple even for a very simple query.

In some applications, however, the relative cost of load shedding may belarger, to the point where ignoring the overhead of load shedding when decidingon the placement of load shedders leads to inefficiencies. The same basicapproach that we have described can be applied in such a context by associatinga processing cost per tuple with load shedding operators. In this case, the bestplacement of load shedders can be found using dynamic programming [1].

2. Load Shedding in Aurora

Similar to STREAM [8], Aurora [3] is a prototype of a data stream manage-ment system that has been designed to deal with a very large numbers of datastreams. The query network in Aurora is a directed acyclic graph (DAG),withsources as data streams and sinks as query output nodes. Internal nodes repre-sent one of seven primitive operators that process tuples, and edges representqueues that feed into these operators. The Aurora query-specification modeldiffers from the one we have described earlier in two important aspects:


The query network allows for binary operators that take input from twoqueues, e.g. (windowed) join of streams. Thus, the query network is notneccesarily a collection of trees.

Aurora allows users to specify three types ofquality of service(QoS)functions that capture the utility of the output to the user: utility as afunction either of output latency, or of the percentage loss in tuples, or ofthe output value of tuples.

A paper by Tatbul et al. [9] discusses load shedding techniques used intheAurora system. We highlight the similarities and differences between their ap-proach and the one that we have described earlier. The query networkstructurein both systems is very similar, except for the provision for binary operatorsin Aurora. This leads to very similar equations for computing the load on thesystem, taking into the account the rates for the input streams, selectivity ofoperators and the time required to process each tuple by different operators.Both approaches use statistics gathered in the near past to estimate these quan-tities. In case of Aurora, the input rate into a binary operator is simply the sumof input rates of the individual input queues. The load equation is periodicallycomputed to determine if the system is overloaded or not and whether we needto shed additional load or reverse any previously-introduced load shedding.Load shedding solutions by both approaches employ thepush load sheddingupstreammantra by virtue of which load shedders are always placed at thebeginning of a shared segment.

The technique that we have described earlier focuses on the class of sliding-window aggregation queries, where the output at any instant is a single numericvalue. The aim was to minimize the maximum (weighted) relative error for allqueries. In contrast, the Aurora load-shedding paper focuses on set-valued (non-aggregate) queries. One could define different metrics when load-shedding inthe context of set-valued queries. We have already described one such simplemetric, namely the fraction of tuples lost for each query. The provision to beable to specify QoS functions leads to an interesting metric in the context ofthe Aurora system: minimize the loss in utility due to load shedding. The QoSfunctions that relate output value and utility let users specify relative importanceof tuples as identified by their attribute values. This leads to a new type of loadshedding operator, one that filters and drops tuples based on their value, asopposed to randomly dropping a fixed fraction of tuples. These are referred toassemantic load shedders. The load shedding algorithms in Aurora follow agreedy approach of introducing load shedders in the query plan that maximizethe gain (amount of load reduced) and minimize the loss in utility as measuredby QoS fuctions. For every potential location for a load shedder, a loss/gainratio is computed which is the ratio of computing cycles that will be savedfor all downstream operators to the loss in utility of all downstream queries,


if we drop a fixed fraction of tuples at this location. In case of semantic loadshedders, filters are introduced that first shed tuples with the least useful values.A plan that introduces drops at different locations along with amount of tuplesdropped is called a Load Shedding Road Map (LSRM). A set of LSRMs isprecomputed based on current statistics and at run-time the system picks theappropriate LSRM based on the current load on the system.

3. Load Shedding for Sliding Window Joins

Queries that involve joins between two or more data streams present aninteresting challenge for load shedding because of the complex interactionsbetween load shedding decisions on the streams being joined. Joins betweendata streams are typicallysliding window joins. A sliding window join withwindow sizew introduces an implicit join predicate that restricts the differencebetween the timestamps of two joining tuples to be at mostw. The implicittime-based predicate is in addition to the ordinary join predicate.

Kang, Naughton, and Viglas [7] study load shedding for sliding window joinqueries with the objective of maximizing the number of output tuples that areproduced. They restrict their attention to queries consisting of a single sliding-window join operator and consider the question of how best to allocate resourcesbetween the two streams that are involved in a join. Their conclusion is that themaximum rate of output tuple production is achieved when the input rates ofthe two data streams being joined, adjusted for the effects of load shedding,areequal. In other words, if streamS1 arrives at rater1 and streamS2 arrives atrater2, and load shedders are placed on each stream upstream of the join, thenthe sampling rate of the load shedder on streamSi should be proportional to1/ri, with the constant of proportionality chosen such that the system is exactlyable to keep up with the data arrival rates.

The paper by Das, Gehrke and Riedwald [5] also addresses the same problem,namely maximizing the join size in the context of load shedding for queriescontaining a single sliding window join. Additionally, they introduce a metriccalled theArchive-metric(ArM) that assumes that any tuples that are load-shedby the system can be archived to allow for computing the exact answer at alater time when the load on the system is less. The ArM metric measures theamount of work that will need to be done at a later time to compute the exactanswer then. They also introduce new models, inspired by different applicationscenarios such as sensor networks, where they distinguish between thecaseswhen the system is bounded in terms of its CPU speed versus when it is boundedby memory. In the latter case, the goal is to bound the size of the join statemeasured in terms of the number of tuples stored for join processing.

The Das et al. paper mainly differs from the Kang et al. paper in that it allowsfor semantic load shedding as opposed to just random load shedding. Theability


to drop tuples based on their join attribute value leads to interesting problems.The one that is the focus of the paper arises from the bounded memory model.In this case, the problem translates to keepingM tuples at all times so as tomaximize the join size, assuming that all incoming tuples are processed andjoined with the partner tuples from other stream that are stored at that timeas part of theM tuples. In the static case, when the streams are not reallystreams but relations, they provide an optimal dynamic programming solutionfor binary joins and show that for anm-relation join, they show that the staticproblem is NP-hard. For the offline case of join between two streams, wherethe arrival order of tuples on both streams is assumed to be known, they providea polynomial-time (though impractical) solution that is based on reducing theproblem to a max-flow computation. They also provide two heuristic solutionsthat can be implemented in a real system.

4. Load Shedding for Classification Queries

Loadstar [4] is a system for executing classification queries over data streams.Data elements arrive on multiple data streams, and the system examines eachdata item as it arrives and attempts to assign it to one of a finite set of classesusing a data mining algorithm. An example would be monitoring images frommultiple security cameras and attempting to determine which person (if any) isdisplayed in each image. If the data arrival rates on the streams are too highfor the system to keep up, then the system must discard certain data elementsunexamined, but it must nonetheless provide a predicted classification forthediscarded elements. The Loadstar system is designed to deal with cases whereonly a small fraction of the data elements can actually be examined, becauseexamining a data element requires expensive feature extraction steps.

The designers of Loadstar introduce two main ideas that are used for loadshedding in this context:

1 A quality of decisionmetric can be used to quantify the expected degra-dation in classification accuracy from failing to examine a data item.In general the quality of decision function will be different for differentstreams. (E.g., examining an image from a security camera in a poorly-litor low-traffic area may not yield much improvement over always guess-ing “no person shown”, whereas analyzing images from other camerasmay allow them to be classfied with high accuracy.)

2 The features used in classification often exhibit a high degree of temporalcorrelation. Thus, if a data element from a particular stream has beenexamined in the recent past, it may be a reasonable assumption that future(unexamined) data elements have similar attribute values. As time passes,uncertainty about the attribute values increases.


The load shedding strategy used in Loadstar makes use of these two ideas todecide which data elements should be examined. Loadstar uses a quality ofdecision metric based on Bayesian decision theory and learns a Markov modelfor each stream to model the rate of dispersion of attribute values over time.By combining these two factors, the Loadstar system is able to achieve betterclassification accuracy than the naive approach that sheds an equal fraction ofload from each data stream.

5. Summary

It is important for computer systems to be able to adapt to changes in theiroperating environments. This is particularly true of systems for monitoring con-tinuous data streams, which are often prone to unpredictable changes in dataarrival rates and data characteristics. We have described a framework for onetype of adaptive data stream processing, namely graceful performance degra-dation via load shedding in response to excessive system loads. In the contextof data stream aggregation queries, we formalized load shedding as an opti-mization problem with the objective of minimizing query inaccuracy within thelimits imposed by resource constraints. Our solution to the load shedding prob-lem uses probabilistic bounds to determine the sensitivity of different queriesto load shedding in order to perform load shedding where it will have minimumadverse impact on the accuracy of query answers. Different queryclasses havedifferent measurements of answer quality, and thus require different techniquesfor load shedding; we described three additional query classes and summarizedload shedding approaches for each.

References

[1] B. Babcock. Processing Continuous Queries over Streaming Data WithLimited System Resources. PhD thesis, Stanford University, Department ofComputer Science, 2005.

[2] B. Babcock, M. Datar, and R. Motwani. Load shedding for aggregationqueries over data streams. InProceedings of the 2004 International Con-ference on Data Engineering, pages 350–361, March 2004.

[3] D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman,M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams–a new classof data management applications. InProc. 28th Intl. Conf. on Very LargeData Bases, August 2002.

[4] Y. Chi, P. S. Yu, H. Wang, and R. R. Muntz. Loadstar: A load sheddingscheme for classifying data streams. InProceedings of the 2005 SIAMInternational Data Mining Conference, April 2005.


[5] A. Das, J. Gehrke, and M. Riedwald. Approximate join processing overdata streams. InProceedings of the 2003 ACM SIGMOD InternationalConf. on Management of Data, pages 40–51, 2003.

[6] W. Hoeffding. Probability inequalities for sums of bounded random vari-ables. InJournal of the American Statistical Association, volume 58, pages13–30, March 1963.

[7] J. Kang, J. F. Naughton, and S. Viglas. Evaluating window joins overunbounded streams. InProceedings of the 2003 International Conferenceon Data Engineering, March 2003.

[8] R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu, M. Datar, G.Manku,C. Olston, J. Rosenstein, and R. Varma. Query processing, approximation,and resource management in a data stream management system. InProc.First Biennial Conf. on Innovative Data Systems Research (CIDR), January2003.

[9] N. Tatbul, U. Cetintemel, S. Zdonik, M. Cherniack, and M. Stonebraker.Load shedding in a data stream manager. InProceedings of the 2003 Inter-national Conference on Very Large Data Bases, pages 309–320, September2003.

Chapter 8

THE SLIDING-WINDOW COMPUTATION MODELAND RESULTS∗

Mayur DatarGoogle, Inc.

[email protected]

Rajeev MotwaniDepartment of Computer ScienceStanford University

[email protected]

Abstract The sliding-window model of computation is motivated by the assumption that,in certain data-stream processing applications, recent data is more useful andpertinent than older data. In such cases, we would like to answer questions aboutthe data only over the lastN most recent data elements (N is a parameter). Weformalize this model of computation and answer questions about how much spaceand computation time is required to solve certain problems under the sliding-window model.

Keywords: sliding-window, exponential histograms, space lower bounds

Sliding-Window Model: Motivation

In this chapter we present some results related to small space computationover sliding windows in the data-stream model. Most research in the data-stream model (e.g. , see [1, 10, 15, 11, 13, 14, 19]), including resultspresentedin some of the other chapters, assume that all data elements seen so far inthe stream are equally important and synopses, statistics or models that arebuilt should reflect the entire data set. However, for many applications this

∗Material in this chapter also appears inData Stream Management: Processing High-Speed DataStreams, edited byMinos Garofalakis, Johannes Gehrke and Rajeev Rastogi, published bySpringer-Verlag.


assumption is not true, particularly those that ascribe more importance to recentdata items. One way to discount old data items and only consider recent onesfor analysis is thesliding-window model: Data elements arrive at every instant;each data element expires after exactlyN time steps; and, the portion of datathat is relevant to gathering statistics or answering queries is the set of lastNelements to arrive. The sliding window refers to the window of active dataelements at a given time instant and window size refers toN .

0.1 Motivation and Road Map

Our aim is to develop algorithms for maintaining statistics and models thatuse space sublinear in the window sizeN . The following example motivateswhy we may not be ready to tolerate memory usage that is linear in the sizeof the window. Consider the following network-traffic engineering scenario: ahigh speed router working at 40 gigabits per second line speed. For every packetthat flows through this router we do a prefix match to check if it originates fromthestanford.edu domain. At every instant, we would like to know how manypackets, of the last1010 packets, belonged to thestanford.edu domain. Theabove question can be rephrased as the following simple problem:

Problem 0.1 (BasicCounting) Given a stream of data elements, con-sisting of0’s and1’s, maintain at every time instant the count of the number of1’s in the lastN elements.

A data element equals one if it corresponds to a packet from thestanford.edu

domain and is zero otherwise. A trivial solution1 exists for this problem thatrequiresN bits of space. However, in such a scenario as the high-speed router,where on-chip memory is expensive and limited, and particularly when wewould like to ask multiple (thousands) such continuous queries, it is prohibitiveto use evenN = 1010 (window size) bits of memory for each query. Unfortu-nately, it is easy to see that the trivial solution is the best we can do in terms ofmemory usage, unless we are ready to settle for approximate answers, i.e. anexact solution toBasicCounting requiresΘ(N) bits of memory. We willpresent a solution to the problem that uses no more thanO(1

ǫ log2N) bits ofmemory (i.e.,O(1

ǫ logN) words of memory) and provides an answer at eachinstant that is accurate within a factor of1±ǫ. Thus, forǫ = 0.1 (10% accuracy)our solution will use about 300 words of memory for a window size of1010.

Given our concern that derives from working with limited space, it is naturalto ask “Is this the best we can do with respect with memory utilization?” Weanswer this question by demonstrating a matching space lower bound, i.e. weshow that any approximation algorithm (deterministic or randomized) forBa-

1Maintain a FIFO queue and update counter.

The Sliding-Window Computation Model and Results2 151

sicCounting with relative errorǫmust useΩ(1ǫ log2N) bits of memory. The

lower bound proves that the above mentioned algorithm is optimal, to withinconstant factors, in terms of memory usage.

Besides maintaining simple statistics like a bit count, as inBasicCount-ing, there are various applications where we would like to maintain morecomplex statistics. Consider the following motivating example:

A fundamental operation in database systems is a join between two or morerelations. Analogously, one can define a join between multiple streams, whichis primarily useful for correlating events across multiple data sources. How-ever, since the input streams are unbounded, producing join results requiresunbounded memory. Moreover, in most cases, we are only interested in thosejoin results where the joining tuples exhibit temporal locality. Consequently,in most data-stream applications, a relevant notion of joins that is often em-ployed is sliding-window joins, where tuples from each stream only join withtuples that belong to a sliding window over the other stream. The semanticsof such a join are clear to the user and also such joins can be processed inanon-blocking manner using limited memory. As a result, sliding-window joinsare quite popular in most stream applications.

In order to improve join processing, database systems maintain “join statis-tics" for the relations participating in the join. Similarly, in order to efficientlyprocess sliding-window joins, we would like to maintain statistics over the slid-ing windows, for streams participating in the join. Besides being useful for theexact computation of sliding-window joins, such statistics could also be usedto approximate them. Sliding-window join approximations have been studiedby Das, Gehrke and Riedwald [6] and Kang, Naughton and Viglas [16].Thisfurther motivates the need to maintain various statistics over sliding windows,using small space and update time.

This chapter presents a general technique, called the Exponential Histogram(EH) technique, that can be used to solve a wide variety of problems in thesliding-window model; typically problems that require us to maintain statistics.We will showcase this technique through solutions to two problems: theBas-icCounting problem above and theSum problem that we will define shortly.However, our aim is not to solely present solutions to these problems, rather toexplain the EH technique itself, such that the reader can appropriately modifyit to solve more complex problems that may arise in various applications. Al-ready, the technique has been applied to various other problems, of whichwewill present a summary in Section 4.

The road map for this chapter is as follows: After presenting an algorithmfor the BasicCounting problem and the associated space lower bound insections 1 and 2 respectively, we present a modified version of the algorithmin Section 3 that solves the following generalization of theBasicCountingproblem:


Problem 0.2 (Sum) Given a stream of data elements that are positive in-tegers in the range[0 . . . R], maintain at every time instant the sum of the lastN elements.

A summary of other results in the sliding-window model is given in Section 4,before concluding in Section 8.1

1. A Solution to theBasicCounting Problem

It is instructive to observe why naive schemes do not suffice for producingapproximate answers with a low memory requirement. For instance, it is naturalto consider random sampling as a solution technique for solving the problem.However, maintaining a uniform random sample of the window elements willresult in poor accuracy in the case where the1’s are relatively sparse.

Another approach is to maintain histograms. While the algorithm that wepresent follows this approach, it is important to note why previously knownhistogram techniques from databases are not effective for this problem. Ahistogram technique is characterized by the policy used to maintain the bucketboundaries. We would like to build time-based histograms in which everybucket summarizes a contiguous time interval and stores the number of1’sthat arrived in that interval. As with all histogram techniques, when a query ispresented we may have to interpolate in some bucket to estimate the answer,because some of the bucket’s elements may have expired. Let us considersomeschemes of bucketizing and see why they will not work. The first schemethat we consider is that of dividing intok equi-width (width of time interval)buckets. The problem is that the distribution of1’s in the buckets may benonuniform. We will incur large error when the interpolation takes place inbuckets with a majority of the1’s. This observation suggests another schemewhere we use buckets of nonuniform width, so as to ensure that each buckethas a near-uniform number of1’s. The problem is that total number of1’s inthe sliding window could change dramatically with time, and current bucketsmay turn out to have more or less than their fair shares of1’s as the windowslides forward. The solution we present is a form of histogram that avoids theseproblems by using a set of well-structured and nonuniform bucket sizes. Itis called the Exponential Histogram (EH) for reasons that will be clear later.Before getting into the details of the solution we introduce some notation.

We follow the conventions illustrated in Figure 8.1. In particular, we assumethat new data elements are coming from the right and the elements at the leftare ones already seen. Note that each data element has anarrival time whichincrements by one at each arrival, with the leftmost element considered to havearrived at time 1. But, in addition, we employ the notion of atimestampwhichcorresponds to the position of anactivedata element in the current window.We timestamp the active data elements from right to left, with the most recent


element being at position1. Clearly, the timestamps change with every newarrival and we do not wish to make explicit updates. A simple solution is torecord the arrival times in a wraparound counter oflogN bits and then thetimestamp can be extracted by comparison with counter value of the currentarrival. As mentioned earlier, we concentrate on the1’s in the data stream.When we refer to thek-th 1, we mean thek-th most recent1 encountered in thedata stream.

41 42 43 44 45 . . . . . 49 50 . . . . . Arrival time

Increasing time

. . .Elements

Current time instance

0 1 1 1 0 0 0 1 0 1 0 0 1 0 1 1 . . .

Increasing ordering of data elements, histogram buckets, active 1’s

Data elements that will be seen in future

Timestamps

Window of active elements

7 6 5 . . . . . 1

Figure 8.1. Sliding window model notation

For an illustration of this notation, consider the situation presented in Fig-ure 8.1. The current time instant is 49 and the most recent arrival is a zero. Theelement with arrival time 48 is the most recent 1 and has timestamp 2 since it isthe second most recent arrival in the current window. The element with arrivaltime 44 is the second most recent 1 and has timestamp 6.

We will maintain histograms for the active1’s in the data stream. For everybucket in the histogram, we keep the timestamp of the most recent1 (calledtimestampfor the bucket), and the number of1’s (called bucket size). Forexample, in our figure, a bucket with timestamp 2 and size 2 represents a bucketthat contains the two most recent 1’s with timestamps 2 and 6. Note thattimestamp of a bucket increases as new elements arrive. When the timestampof a bucket expires (reachesN+1), we are no longer interested in data elementscontained in it, so we drop that bucket and reclaim its memory. If a bucket is stillactive, we are guaranteed that it contains at least a single1 that has not expired.Thus, at any instant there is at most one bucket (the last bucket) containing1’s that may have expired. At any time instant we may produce an estimateof the number of active1’s as follows. For all but the last bucket, we add thenumber of1’s that are in them. For the last bucket, letC be the count of thenumber of1’s in that bucket. The actual number of active1’s in this bucket


could be anywhere between1 andC, so we estimate it to beC/2. We obtainthe following:

Fact 1.1 The absolute error in our estimate is at mostC/2, whereC is thesize of the last bucket.

Note that, for this approach, the window size does not have to be fixed a-prioriatN . Given a window sizeS (S ≤ N ), we do the same thing as before exceptthat the last bucket is the bucket with the largest timestamp less thanS.

1.1 The Approximation Scheme

We now define the Exponential Histograms and present a technique to main-tain them, so as to guarantee count estimates with relative error at mostǫ, forany ǫ > 0. Definek = ⌈1ǫ ⌉, and assume thatk2 is an integer; ifk2 is not an

integer we can replacek2 by ⌈k2⌉ without affecting the basic results.

As per Fact 1.1, the absolute error in the estimate isC/2, whereC is thesize of the last bucket. Let the buckets be numbered from right to left with themost recent bucket being numbered1. Letm denote the number of buckets andCi denote the size of thei-th bucket. We know that the true count is at least1 +

∑m−1i=1 Ci, since the last bucket contains at least one unexpired1 and the

remaining buckets contribute exactly their size to total count. Thus, the relativeestimation error is at most(Cm/2)/(1 +

∑m−1i=1 Ci). We will ensure that the

relative error is at most1/k by maintaining the following invariant:

Invariant 1.2 At all times, the bucket sizesC1, . . . , Cm are such that: Forall j ≤ m, we haveCj/〈2(1 +

∑j−1i=1 Ci) ≤ 1

k 〉.

Let N ′ ≤ N be the number of1’s that are active at any instant. Then thebucket sizes must satisfy

∑mi=1Ci ≥ N ′. Our goal is to satisfy this property

and Invariant 1.2 with as few buckets as possible. In order to achieve thisgoalwe maintain buckets with exponentially increasing sizes so as to satisfy thefollowing second invariant.

Invariant 1.3 At all times the bucket sizes are nondecreasing, i.e.,C1 ≤C2 ≤ · · · ≤ Cm−1 ≤ Cm. Further, bucket sizes are constrained to the follow-

ing: 1, 2, 4, . . . , 2m′, for somem′ ≤ m andm′ ≤ log 2Nk + 1. For every

bucket size other than the size of the first and last bucket, there are at most k2 +1

and at leastk2 buckets of that size. For the size of the first bucket, which is equalto one, there are at mostk + 1 and at leastk buckets of that size. There are atmostk2 buckets with size equal to the size of the last bucket.

Let Cj = 2r (r > 0) be the size of thej-th bucket. If the size of the lastbucket is1 then there is no error in estimation since there is only data element


in that bucket for which we know the timestamp exactly. If Invariant 1.3 issatisfied, then we are guaranteed that there are at leastk

2 buckets each of sizes2, 4, . . . , 2r−1 and at leastk buckets of size1, which have indexes less thanj. Consequently,Cj < 2

k (1 +∑j−1

i=1 Ci). It follows that if Invariant 1.3is satisfied then Invariant 1.2 is automatically satisfied, at least with respectto buckets that have sizes greater than 1. If we maintain Invariant 1.3, it iseasy to see that to cover all the active1’s, we would require no more than

m ≤ (k2 + 1)(log(2N

k ) + 2) buckets. Associated with each bucket is its sizeand a timestamp. The bucket size takes at mostlogN values, and hence we canmaintain them usinglog logN bits. Since a timestamp requireslogN bits, thetotal memory requirement of each bucket islogN + log logN bits. Therefore,the total memory requirement (in bits) for an EH isO(1

ǫ log2N). It is impliedthat by maintaining Invariant 1.3, we are guaranteed the desired relative errorand memory bounds.

The query time for EH can be madeO(1) by maintaining two counters, onefor the size of the last bucket (Last) and one for the sum of the sizes of allbuckets (Total). The estimate itself isTotal minus half ofLast. Bothcounters can be updated inO(1) time for every data element. See the boxbelow for a detailed description of the update algorithm.

Algorithm (Insert):

1 When a new data element arrives, calculate the new expiry time. If the timestamp of thelast bucket indicates expiry, delete that bucket and update the counterLast containingthe size of the last bucket and the counterTotal containing the total size of the buckets.

2 If the new data element is0 ignore it; else, create a new bucket with size1 and the currenttimestamp, and increment the counterTotal.

3 Traverse the list of buckets in order of increasing sizes. If there arek2

+ 2 buckets ofthe same size (k + 2 buckets if the bucket size equals1), merge the oldest two of thesebuckets into a single bucket of double the size. (A merger of buckets of size 2r maycause the number of buckets of size2r+1 to exceedk

2+ 1, leading to a cascade of such

mergers.) Update the counterLast if the last bucket is the result of a new merger.

Example 8.1 We illustrate the execution of the algorithm for 10 steps, whereat each step the new data element is1. The numbers indicate the bucket sizesfrom left to right, and we assume thatk

2 = 1.

32, 32, 16, 8, 8, 4, 2, 1,1

32, 32, 16, 8, 8, 4, 4, 2, 1, 1, 1 (new 1 arrived)

32, 32, 16, 8, 8, 4, 4, 2, 1, 1, 1, 1 (new 1 arrived)

32, 32, 16, 8, 8, 4, 4, 2, 2, 1, 1 (merged the older 1’s)

32, 32, 16, 8, 8, 4, 4, 2, 2, 1, 1, 1 (new 1 arrived)

32, 32, 16, 8, 8, 4, 4, 2, 2, 1, 1, 1, 1 (new 1 arrived)

32, 32, 16, 8, 8, 4, 4, 2, 2, 2, 1, 1 (merged the older 1’s)

32, 32, 16, 8, 8, 4, 4, 4, 2, 1, 1 (merged the older 2’s)


32, 32, 16, 8, 8, 8, 4, 2, 1, 1 (merged the older 4’s)

32, 32, 16, 16, 8, 4, 2, 1, 1 (merged the older 8’s)

Merging two buckets corresponds to creating a new bucket whose size isequal to the sum of the sizes of the two buckets and whose timestamp is thetimestamp of the more recent of the two buckets, i.e. the timestamp of thebucket that is to the right. A merger requiresO(1) time. Moreover, whilecascading may requireΘ(log 2N

k ) mergers upon the arrival of a single newelement, a simple argument, presented in the next proof, allows us to arguethat the amortized cost of mergers isO(1) per new data element. It is easy tosee that the above algorithm maintains Invariant 1.3. We obtain the followingtheorem:

Theorem 1.4 The EH algorithm maintains a data structure that gives anestimate for theBasicCounting problem with relative error at mostǫ usingat most(k

2 +1)(log(2Nk )+2) buckets, wherek = ⌈1ǫ ⌉. The memory requirement

is logN + log logN bits per bucket. The arrival of each new element can beprocessed inO(1) amortized time andO(logN) worst-case time. At each timeinstant, the data structure provides a count estimate inO(1) time.

Proof: The EH algorithm above, by its very design, maintains Invariant 1.3.As noted earlier, an algorithm that maintains Invariant 1.3, requires no morethan(k

2 +1)(log(2Nk )+2) buckets to cover all the active1’s. Furthermore, the

invariant also guarantees that our estimation procedure has a relative error nomore than1/k ≤ ǫ.

Each bucket maintains a timestamp and the size for that bucket. Since wemaintain timestamps using wraparound arrival times, they require no more thanlogN bits of memory. As per Invariant 1.3, bucket sizes can take only one of thelog 2N

k + 1 unique values, and can be represented usinglog logN bits. Thus,the total memory requirement of each bucket is no more thanlogN+log logNbits.

On the arrival of a new element, we may perform a cascading merge ofbuckets, that takes time proportional to the number of buckets. Since there areO(logN) buckets, this gives a worst case update time ofO(logN). Whenevertwo buckets are merged, the size of the merged bucket is double the size ofthose that are merged. The cost of the merging can be amortized among all the1’s that fall in the merged bucket. Thus, an element that belongs to a bucketof size2p, pays an amortized cost1 + 1/2 + 1/4 + · · · + 1/2p ≤ 2. This isbecause, whenever it gets charged, the size of the bucket it belongs todoublesand consequently the charge it incurs halves. Thus, we get that the amortizedcost of merging buckets isO(1) per new element, in factO(1) per new elementthat has value1.


We maintain countersTotal andLast, which can be updated inO(1) timefor each new element, and which enable us to give a count estimate inO(1)time whenever a query is asked.

If instead of maintaining a timestamp for every bucket, we maintain a times-tamp for the most recent bucket and maintain the difference between the times-tamps for the successive buckets then we can reduce the total memory require-ment toO(k log2 N

k ).

2. Space Lower Bound forBasicCounting Problem

We provide a lower bound which verifies that the algorithms is optimalin its memory requirement. We start with a deterministic lower bound ofΩ(k log2 N

k ). We omit proofs for lack of space, and refer the reader to [8].

Theorem 2.1 Any deterministic algorithm that provides an estimate for theBasicCounting problem at every time instant with relative error less than1k for some integerk ≤ 4

√N requires at leastk16 log2 N

k bits of memory.

The proof argument goes as follows: At any time instant, the space utilized byany algorithm, is used to summarize the contents of the current active window.For a window of sizeN , we can show that there are a large number of possibleinput instances, i.e. arrangements of0’s and1’s, such that any deterministicalgorithm which provides estimates with small relative error (i.e. less than1

k )has to differentiate between every pair of these arrangements. The number ofmemory bits required by such an algorithm must therefore exceed the logarithmof the number of arrangements. The above argument is formalized by thefollowing lemma.

Lemma 2.2 For k/4 ≤ B ≤ N , there existL =(

Bk/4

)⌊log NB⌋

arrangements

of 0’s and1’s of lengthN such that any deterministic algorithm forBasic-Counting with relative error less than1k must differentiate between any twoof the arrangements.

To prove Theorem 2.1, observe that if we chooseB =√Nk in the lemma

above thenlogL ≥ k16 log2 N

k . While the lower bound above is for a de-terministic algorithm, a standard technique for establishing lower bounds forrandomized algorithms, called theminimax principle[18], lets us extend thislower bound on the space complexity to randomized algorithms.

As a reminder, aLas Vegas algorithmis a randomized algorithm that alwaysproduces the correct answer, although the running time or space requirementof the algorithm may vary with the different random choices that the algorithmmakes. On the other hand, aMonte Carlo algorithmis a randomized algorithm


that sometimes produces an incorrect solution. We obtain the following lowerbounds for these two classes of algorithms.

Theorem 2.3 Any randomized Las Vegas algorithm forBasicCountingwith relative error less than1k , for some integerk ≤ 4

√N , requires an expected

memory of at leastk16 log2 Nk bits.

Theorem 2.4 Any randomized Monte Carlo algorithm forBasicCount-ing problem with relative error less than1k , for some integerk ≤ 4

√N , with

probability at least1 − δ (for δ < 12 ) requires an expected memory of at least

k32 log2 N

k + 12 log(1− 2δ) bits.

3. Beyond0’s and 1’s

TheBasicCounting problem, discussed in the last two sections, is oneof the basic operations that one can define over sliding windows. While theproblem in its original form has various applications, it is natural to ask “Whatare the other problems, in the sliding-window model, that can be solved usingsmall space and small update time?”. For instance, instead of the data elementsbeing binary values, namely0 and1, what if they were positive integers in therange[0 . . . R] ? Could we efficiently maintain the sum of these numbers in thesliding-window model? We have already defined this problem, in Section 8, astheSum problem.

We will now present a modification of the algorithm from Section 1, thatsolves theSum problem. In doing so, we intend to highlight the characteristicelements of the solution technique, so that readers may find it easy to adaptthe technique to other problems. Already, the underlying technique has beensuccessfully applied to many problems, some of which will be listed in thefollowing section.

One way to solve theSum problem would be to maintain separately a slidingwindow sum for each of thelogR bit positions using an EH from Section 1.1.As before, letk = ⌈1ǫ ⌉. The memory requirement for this approach would beO(k log2N logR) bits. We will present a more direct approach that uses lessmemory. In the process we demonstrate how the EH technique introduced inSection 1 can be generalized to solving a bigger class of problems.

Typically, a problem in the sliding-window model requires us to maintain afunctionf defined over the elements in the sliding window. Letf(B) denotethe function value restricted to the elements in a bucketB. For example, incase of theSum problem, the functionf equals the sum of the positive integersthat fall inside the sliding-window. In case ofBasic Counting the functionf is simply the number of1’s that fall inside the sliding-window. We note thefollowing central ingredients of the EH technique from Section 1 and adaptthem for theSum problem :


1 Size of a Bucket:Thesizeof each bucket is defined as the value of thefunctionf that we are estimating (over the sliding window), restricted tothat bucketB, i.e.,f(B). In the earlier casesizewas simply the count of1’s falling inside the bucket. ForSum, we definesizeanalogously as thesum of integers falling inside the bucket.

2 Procedure to Merge Buckets or Combination Rule: Whenever thealgorithm decides to merge two adjacent buckets, a new bucket is createdwith timestamp equal to that of the newer bucket or the bucket to the right.Thesizeof this new bucket is computed using the sizes of the individualbuckets (i.e., , usingf(B)’s for the buckets that are merged) and anyadditional information that may be stored with the buckets.6 Clearly, forthe problem of maintaining the sum of data elements, which are either0’s and1’s or positive integers, no additional information is required. Bydefinition, the size of the new merged bucket is simply the sum of thesizes of buckets being merged .

3 Estimation: Whenever a query is asked, we need to estimate the answerat that moment based on the sizes of all the buckets and any additionalinformation that we may have kept. In order to estimate the answer, wemay be required to “interpolate" over the last bucket that is part insideand part outside the sliding window, i.e., the “straddling" bucket.

Typically, this is done by computing the function valuef over all bucketsother than the last bucket. In order to do this, we use the same procedureas in the Merge step. To this value we may add the interpolated value ofthe functionf from the last bucket.

Again, for the problem of maintaining the sum of positive integers thistask is relatively straightforward. We simply add up the sizes of all thebuckets that are completely inside the sliding window. To this we add the“interpolated" value from the last bucket, which is simply half the sizeof the last bucket.

4 Deleting the Oldest Bucket:In order to reclaim memory, the algorithmdeletes the oldest bucket when its timestamp reachesN + 1. This step issame irrespective of the functionf we are estimating.

The technique differs for different problems in the particulars of how thesteps above are executed and the rules for when to merge old buckets and

6There are problems for which just knowing the sizes of the buckets that are merged is not sufficient tocompute the size of the new merged bucket. For example, if the function f is the variance of numbers,in addition to knowing the variance of the buckets that are merged, we also need to know the number ofelements in each bucket and mean value of the elements from each bucket, in order to compute the variancefor the merged bucket. See [4] for details.


create new ones, as new data elements get inserted. The goal is to maintainas few buckets as possible, i.e., merge buckets whenever possible, while atthesame time making sure that the error due to the estimation procedure, whichinterpolates for the last bucket, is bounded. Typically, this goal is achieved bymaintaining that the bucket sizes grow exponentially from right to left (newto old) and hence the name Exponential Histograms (EH). It is shown in [8]that the technique can be used to estimate a general class of functionsf , calledweakly-additivefunctions, over sliding windows. In the following section, welist different problems over sliding windows, that can be solved using the EHtechnique.

TIME

(current window, size N)

Bm∗

Bm−

B1B2Bm−1Bm

Figure 8.2. An illustration of an Exponential Histogram (EH).

We need some more notation to demonstrate the EH technique for theSumproblem. Let the buckets in the histogram be numberedB1, B2, . . . , Bm, start-ing from most recent (B1) to oldest (Bm); further, t1, t2, . . . , tm denote thebucket timestamps. See Figure 8.2 for an illustration. In addition to the bucketsmaintained by the algorithm, we define another set ofsuffix buckets, denotedB1∗ , . . . , Bj∗ , that represent suffixes of the data stream. BucketBi∗ representsall elements in the data stream that arrived after the elements of bucketBi, thatis, Bi∗ =

⋃i−1l=1 Bl. We do not explicitly maintain the suffix buckets. LetSi

denote the size of bucketBi. Similarly, letSi∗ denote the size of the suffixbucketBi∗ . Note, for theSum problemSi∗ =

∑i−1l=1 Sl. Let Bi,i−1 denote

the bucket that would be formed by merging bucketsi andi − 1, andSi,i−1

(Si,i−1 = Si +Si−1) denote the size of this bucket. We maintain the followingtwo invariants that guarantee a small relative errorǫ in estimation and smallnumber of buckets:

Invariant 3.1 For every bucketBi, 12ǫSi ≤ Si∗ .

Invariant 3.2 For eachi > 1, for every bucketBi,

1

2ǫSi,i−1 > Si−1∗ .

It follows from Invariant 3.1 that the relative error in estimation is no morethan Sm

2Sm∗≤ ǫ. Invariant 3.2 guarantees that the number of buckets is no


more thanO(1ǫ (logN + logR)). It is easy to see the proof of this claim.

SinceSi∗ =∑i−1

l=1 Sl, i.e., the bucket sizes are additive, after every⌈1/ǫ⌉buckets (rather⌈1/2ǫ⌉ pairs of buckets) the value ofSi∗ doubles. As a result,afterO(1

ǫ (logN + logR)) buckets the value ofSi∗ exceedsNR, which is themaximum value that can be achieved bySi∗ . We now present a simple insertalgorithm that maintains the two invariants above as new elements arrive.

Algorithm (Insert): xt denotes the most recent element.

1 If xt = 0, then do nothing. Otherwise, create a new bucket forxt. The new bucketbecomesB1 with S1 = xt. Every old bucketBi becomesBi+1.

2 If the oldest bucketBm has timestamp greater thanN , delete the bucket. BucketBm−1

becomes the new oldest bucket.

3 Merge step: Let k = 12ǫ

. While there exists an indexi > 2 such thatkSi,i−1 ≤ Si−1∗ ,find the smallest suchi and combine bucketsBi andBi−1 using the combination ruledescribed earlier. Note thatSi∗ value can be computed incrementally by addingSi−1

andSi−1∗ , as we make the sweep.

Note, Invariant 3.1 holds for buckets that have been formed as a resultof themerging of two or more buckets because the merging condition assures that itholds for the merged bucket. Addition of new elements in the future does notviolate the invariant, since the right-hand side of the invariant can only increaseby addition of the new elements. However, the invariant may not hold for abucket that contains a singleton nonzero element and was never merged.Thefact that the invariant does not hold for such a bucket, does not affect the errorbound for the estimation procedure because, if such a bucket were to becomethe last bucket, we know the exact timestamp for the only non zero element inthe bucket. As a result there is no interpolation error in that case.

Analogously to the variablesTotal andLast in Section 1.1, we can main-tain Sm + Sm∗ andSm that enable us to answer queries inO(1) time. Thealgorithm for insertion requiresO(1

ǫ (logN + logR)) time per new element.Most of the time is spent in Step 3, where we make the sweep to combine buck-ets. This time is proportional to number of buckets, (O(1

ǫ (logN + logR))). Asimple trick, to skip Step 3 until we have seenΘ(1

ǫ (logN+logR)) data points,ensures that the running time of the algorithm is amortizedO(1). While we mayviolate Invariant 3.2 temporarily, we restore it after seeingΘ(1

ǫ (logN+logR))data points, by executing Step 3, which ensures that the number of bucketsisO(1

ǫ (logN+logR)). The space requirement for each bucket (memory neededto maintain timestamp and size) islogN+logR bits. If we assume that a wordis at leastlogN + logR bits long, equivalently the size required to count uptoNR, which is the maximum value of the answer, we get that the total mem-ory requirement isO(1

ǫ (logN + logR)) words orO(1ǫ (logN + logR)2) bits.

Please refer to [8] for a more complex procedure that has similar time require-


ments and space requirementO(1ǫ logN(logN + logR)) bits. To summarize,

we get the following theorem:

Theorem 3.3 The sum of positive integers in the range[0 . . . R] can be es-timated over sliding windows with relative error at mostǫ usingO(1

ǫ (logN +logR)) words of memory. The time to update the underlying EH is worst caseO(1

ǫ (logN + logR)) and amortizedO(1).

Similar to the space lower bound that we presented in Section 2, one canshow a space lower bound ofΩ(1

ǫ (logN + logR)(logN)) bits for theSumproblem. See [8] for details. This is asymptotically equal to the upper boundfor the algorithm in [8] that we mentioned earlier.

It is natural to ask the question: What happens if we do not restrict dataelements to positive integers and are interested in estimating the sum over slidingwindows. We show that even if we restrict the set of unique data elements to1, 0,−1, to approximate the sum within a constant factor requiresΩ(N)bits of memory. Moreover, it is easy to maintain the sum by storing the lastN integers which requiresO(N) bits of memory. We assume that the storagerequired for every integer is a constant independent of the window sizeN . Withthis assumption, we have that the complexity of the problem in the general case(allowing positive and negative integers) isΘ(N).

We now argue the lower bound ofΩ(N). Consider an algorithmA thatprovides a constant-factor approximation to the problem of maintaining thegeneral sum. Given a bit vector of sizeN/2 we present the algorithmA withthe pair(−1, 1) for every1 in the bit vector and the pair(1,−1) for every0.Consider the state (time instance) after we have presented all theN/2 pairs tothe algorithm. We claim that we can completely recover the original bit vectorby presenting a sequence of0’s henceforth and querying the algorithm on everyodd time instance. If the current time instance isT (after having presentedtheN/2 pairs) then it is easy to see that the correct answer at time instanceT + 2i − 1 (1 ≤ i ≤ N/2) is 1 iff the ith bit was1 and−1 iff the ith bit was0. Since the algorithmA gives a constant factor approximation its estimatewould be positive if the correct answer is1 and negative if the correct answerwas−1. Since the state of the algorithm after feeding theN/2 pairs enables usto recover the bit vector exactly for any arbitrary bit vector it must be using atleastN/2 bits of memory to encode it. This proves the lower bound. We canstate the following theorem:

Theorem 3.4 The space complexity of any algorithm that gives a constantfactor approximation, at every instant, to the problem of maintaining the sumof lastN integers (positive or negative) that appear as stream of data elementsis equal toΘ(N).


4. References and Related Work

The EH technique, that we demonstrate through solutions to theBasic-Counting andSum problem, is by Datar, Gionis, Indyk and Motwani [8].The space lower bounds, presented above, are also from that paper. In the samepaper, the authors characterize a general class ofweakly additivefunctions thatcan be efficiently estimated over sliding windows, using the EH technique. Alsosee, Datar’s PhD thesis [7] for more details.

As we have seen in other chapters from this book, it is often the case thatinput data streams are best visualized as a high dimensional vector. A standardoperation is to compute thelp norm, for 0 < p ≤ 2, of these vectors orthe lp norm of the difference between two vectors. In Chapter, we have seensketchingtechniques to estimate theselp norms using small space. It turnsout that, when each data element in the data stream represents an incrementto some dimension of the underlying high dimensional vector, thelp norm ofa vector belongs to the class ofweakly additivefunctions mentioned above.Consequently, for the restricted case when the increments are positive, the EHtechnique in conjunction with the sketching technique, can be adapted to theestimatelp norms over the sliding windows. See [8, 7] for details.

Babcock, Datar, Motwani and O’Callaghan [4] showed that the variance ofreal numbers with maximum absolute valueR, can be estimated over slidingwindows with relative error at mostǫ usingO( 1

ǫ2(logN + logR)) words of

memory. The update time for the data structure is worst caseO( 1ǫ2

(logN +logR)) and amortizedO(1). In the same paper, the authors look at the problemof maintainingk-medians clustering of points over a sliding window. Theypresent an algorithm that usesO( k

τ4N2τ log2N) memory9 and presentsk cen-

ters, for which the objective function value is within a constant factor (2O(1/τ))of optimal, whereτ < 1/2 is a parameter which captures the trade-off betweenthe space bound and the approximation ratio. The update time for the datastructure is worst caseO(k2

τ3N2τ ) and amortizedO(k). Both these algorithms

are an adaptation of the EH technique, presented in Section 3 above.In this chapter, we have focussed on the sliding-window model, that assumes

that the pertinent data set is the lastN data elements, i.e., we focus onsequence-based sliding-window model. In other words, we assumed that data items arriveat regular time intervals and arrival time increases by one with every new dataitem that we have seen. Such regularity in arrival of data items is seldom trueformost real life applications, for which arrival rates of data items may be bursty.Often, we would like to define the sliding window based on real time. It is easy

9The space required to hold a single data point, which in this case is a point from some metric space, isassumed to beO(1) words.


to adapt the EH technique to such atime-basedsliding-window model. See[8, 7] for details.

One may argue that the sliding-window model is not the right model todiscount old data, in the least not the only model. If our aim is to assign asmaller weight to older elements so that they contribute less to any statistics ormodels we maintain, we may want to consider other monotonically decreasingfunctions (time decayed functions) for assigning weights to elements other thanthe step function (1 for the lastN elements and0 beyond) that is implicit in thesliding-window model. A natural decay function is the exponentially decreasingweight function that was considered by Gilbert et al. [12] in maintainingagedaggregates: For a data stream. . . , x(−2), x(−1), x(0), wherex(0) is the mostrecently seen data element,λ-aging aggregate is defined asλx(0) + λ(1 −λ)x(−1) +λ(1−λ)2x(−2) + . . .. Exponentially decayed statistics as above areeasy to maintain, although one may argue that exponential decay of weightsis not suited for all applications or is too restrictive. We may desire a richerclass of decay functions, e.g. polynomially decaying weight functions insteadof exponential decay. Cohen and Strauss [5] show how to maintain statisticsefficiently for a general class of time decaying functions. Their solutions usethe EH technique as a building block or subroutine, there by demonstrating theapplicability of the EH technique to a wider class of models that allow for timedecay, besides the sliding-window model that we have considered.

See [7] for solutions to other problems in the sliding-window model, thatdo not rely on the EH technique. These problems include maintaining a uni-form random sample(See also [3]), maintaining the min/max of real numbers,estimating the ratio ofrare11 elements to the number of distinct elements(Seealso [9]), and estimating the similarity between two data streams measuredaccording to the Jaccard coefficient for set similarity between two setsA,B:|A⋂

B|/|A⋃B|(See also [9]).

Maintaining approximate counts of high frequency elements and maintain-ing approximate quantiles, are important problems that have been studied indatabase research as maintaining end-biased histograms and maintaining equi-depth histograms. These problems are particularly useful for sliding-windowjoin processing; they provide the necessary join statistics and can also be usedfor approximate computation of joins. A solution to these problems, in thesliding-window model, is presented by Arasu and Manku [2] and Lu et al. [17].

5. Conclusion

In this chapter we have studied algorithms for two simple problems,Basic-Counting andSum, in the sliding-window model; a natural model to discount

11An element is termed rare if it occurs only once (or a small number of times) in the sliding window


Problem Space requirement Space lower Amortized Worst caserequirement bound update update

(in words) time time(in words) (when available)

Basic O( 1ǫlogN) Ω( 1

ǫlogN) O(1) Ω( 1

ǫlogN)

Counting

Sum O( 1ǫlogN) Ω( 1

ǫlogN) O(1) Ω( 1

ǫ(logN+

logR))

Variance O( 1ǫ2

(logN+ Ω( 1ǫlogN) O(1) O( 1

ǫ2(logN+

logR)) logR))

lp norm O(logN) O(1) O(logN)sketches

k-median O( kτ4N

2τ · Ω( 1ǫlogN) O(k) O( k2

τ3N2τ )

(2O(1/τ)-approx.) log2N)

Min/Max O(N) Ω(N) O(logN) O(logN)

Similarity O(logN) O(log logN) O(log logN)(w.h.p) (w.h.p.)

Rarity O(logN) O(log logN) O(log logN)(w.h.p) (w.h.p.)

Approx. O( 1ǫlog2 1

ǫ) Ω(1/ǫ) O(log(1/ǫ)) O(1/ǫ)

counts

Quantiles O( 1ǫlog 1

ǫlogN) Ω(1/ǫ O(log(1/ǫ) O(1/ǫ log(1/ǫ))

log(ǫN/ log(ǫN/log(1/ǫ))) log(1/ǫ)))

Table 8.1. Summary of results for the sliding-window model.


stale data that only considers the lastN elements as being pertinent. Our aimwas to showcase the Exponential Histogram (EH) technique that has beenusedto efficiently solve various problems over sliding windows. We also presentedspace lower bounds for the two problems above. See Table 8.1 for a summaryof results in the sliding-window model. Note, for this summary, we measurememory in words, where a word is assumed large enough to hold the answeror one unit of answer. For example, in the case ofBasicCounting a word isassumed to belogN bits long, forSum word is assumed to belogN + logRbits long, forlp norm sketches we assume that sketches can fit in a word, forclustering a word is assumed large enough to be able to hold a single point fromthe metric space, and so on. Similarly, we assume it is possible to do a singleword operation in one unit of time while measuring time requirements.

References

[1] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximat-ing the frequency moments. InProc. of the 1996 Annual ACM Symp. onTheory of Computing, pages 20–29, 1996.

[2] A. Arasu and G. Manku. Approximate counts and quantiles over slidingwindows,. InProc. of the 2004 ACM Symp. Principles of Database Systems,pages 286–296, June 2004.

[3] B. Babcock, M. Datar, and R. Motwani. Sampling from a moving windowover streaming data. InProc. of the 2002 Annual ACM-SIAM Symp. onDiscrete Algorithms, pages 633–634, 2002.

[4] B. Babcock, M. Datar, R. Motwani, and L. O’Callaghan. Maintainingvariance and k-medians over data stream windows. InProc. of the 2003ACM Symp. on Principles of Database Systems, pages 234–243, June 2003.

[5] E. Cohen and M. Strauss. Maintaining time-decaying stream aggregates.In Proc. of the 2003 ACM Symp. on Principles of Database Systems, pages223–233, June 2003.

[6] A. Das, J. Gehrke, and M. Riedwald. Approximate join processing over datastreams. InProc. of the 2003 ACM SIGMOD Intl. Conf. on Managementof Data, pages 40–51, 2003.

[7] M. Datar. Algorithms for Data Stream Systems. PhD thesis, StanfordUniversity, Stanford, CA, USA, December 2003.

[8] M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statisticsover sliding windows. SIAM Journal on Computing, 31(6):1794–1813,2002.

[9] M. Datar and S. Muthukrishnan. Estimating rarity and similarity overdata stream windows. InProc. of the 2002 Annual European Symp. onAlgorithms, pages 323–334, September 2002.


[10] J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. An approxi-mate l1-difference algorithm for massive data streams. InProc. of the 1999Annual IEEE Symp. on Foundations of Computer Science, pages 501–511,1999.

[11] A. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, andM. Strauss. Fast, small-space algorithms for approximate histogram main-tenance. InProc. of the 2002 Annual ACM Symp. on Theory of Computing,2002.

[12] A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Surfing waveletson streams: One-pass summaries for approximate aggregate queries. InProc. of the 2001 Intl. Conf. on Very Large Data Bases, pages 79–88, 2001.

[13] M. Greenwald and S. Khanna. Space-efficient online computation ofquantile summaries. InProc. of the 2001 ACM SIGMOD Intl. Conf. onManagement of Data, pages 58–66, 2001.

[14] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering datastreams. InProc. of the 2000 Annual IEEE Symp. on Foundations of Com-puter Science, pages 359–366, November 2000.

[15] P. Indyk. Stable distributions, pseudorandom generators, embeddings anddata stream computation. InProc. of the 2000 Annual IEEE Symp. onFoundations of Computer Science, pages 189–197, 2000.

[16] J. Kang, J. F. Naughton, and S. Viglas. Evaluating window joins overunbounded streams. InProc. of the 2003 Intl. Conf. on Data Engineering,March 2003.

[17] X. Lin, H. Lu, J. Xu, and J. X. Yu. Continuously maintaining quantilesummaries of the most recentn elements over a data stream. InProc. ofthe 2004 Intl. Conf. on Data Engineering, March 2004.

[18] R. Motwani and P. Raghavan.Randomized Algorithms. Cambridge Uni-versity Press, 1995.

[19] J.S. Vitter. Random sampling with a reservoir.ACM Trans. on Mathe-matical Software, 11(1):37–57, 1985.

Chapter 9

A SURVEY OF SYNOPSIS CONSTRUCTIONIN DATA STREAMS


[email protected]

Philip S. YuIBM T. J. Watson Research CenterHawthorne, NY 10532

[email protected]

AbstractThe large volume of data streams poses unique space and time constraintson

the computation process. Many query processing, database operations, and min-ing algorithms require efficient execution which can be difficult to achievewitha fast data stream. In many cases, it may be acceptable to generateapproximatesolutionsfor such problems. In recent years a number ofsynopsis structureshave been developed, which can be used in conjunction with a variety of miningand query processing techniques in data stream processing. Some keysynopsismethods include those of sampling, wavelets, sketches and histograms. In thischapter, we will provide a survey of the key synopsis techniques, and the min-ing techniques supported by such methods. We will discuss the challengesandtradeoffs associated with using different kinds of techniques, and the importantresearch directions for synopsis construction.

1. Introduction

Data streams pose a unique challenge to many database and data miningapplications because of the computational and storage costs associated withthe large volume of the data stream. In many cases, synopsis data structures


and statistics can be constructed from streams which are useful for a variety ofapplications. Some examples of such applications are as follows:

Approximate Query Estimation: The problem of query estimation ispossibly the most widely used application of synopsis structures [11].The problem is particularly important from an efficiency point of view,since queries usually have to be resolved in online time. Therefore, mostsynopsis methods such as sampling, histograms, wavelets and sketchesare usually designed to be able to solve the query estimation problem.

Approximate Join Estimation: The efficient estimation of join size is aparticularly challenging problem in streams when the domain of the joinattributes is particularly large. Many methods [5, 26, 27] have recentlybeen designed for efficient join estimation over data streams.

Computing Aggregates:In many data stream computation problems, itmay be desirable to compute aggregate statistics [40] over data streams.Some applications include estimation of frequency counts, quantiles, andheavy hitters [13, 18, 72, 76]. A variety of synopsis structures such assketches or histograms can be useful for such cases.

Data Mining Applications: A variety of data mining applications suchas change detection do not require to use the individual data points, butonly require a temporal synopsis which provides an overview of the be-havior of the stream. Methods such as clustering [1] and sketches [88]can be used for effective change detection in data streams. Similarly,many classification methods [2] can be used on a supervised synopsis ofthe stream.

The design and choice of a particular synopsis method depends on the problembeing solved with it. Therefore, the synopsis needs to be constructed in away which is friendly to the needs of the particular problem being solved.For example, a synopsis structure used for query estimation is likely to be verydifferent from a synopsis structure used for data mining problems such as changedetection and classification. In general, we would like to construct the synopsisstructure in such a way that it has wide applicability across broad classes ofproblems. In addition, the applicability to data streams makes the efficiencyissue of space and time-construction critical. In particular, the desiderata foreffective synopsis construction are as follows:

Broad Applicability: Since synopsis structures are used for a varietyof data mining applications, it is desirable for them to have as broadan applicability as possible. This is because one may desire to use theunderlying data stream for as many different applications. If synopsisconstruction methods have narrow applicability, then a different structure

A Survey of Synopsis Construction in Data Streams 171

will need to be computed for each application. This will reduce the timeand space efficiency of synopsis construction.

One Pass Constraint:Since data streams typically contain a large num-ber of points, the contents of the stream cannot be examined more thanonce during the course of computation. Therefore, all synopsis construc-tion algorithms are designed under a one-pass constraint.

Time and Space Efficiency:In many traditional synopsis methods onstatic data sets (such as histograms), the underlying dynamic program-ming methodologies require super-linear space and time. This is notacceptable for a data stream. For the case of space efficiency, it is notdesirable to have a complexity which is more than linear in the size ofthe stream. In fact, in some methods such as sketches [44], the spacecomplexity is often designed to be logarithmic in thedomain-sizeof thestream.

Robustness:The error metric of a synopsis structure needs to be designedin a robust way according to the needs of the underlying application. Forexample, it has often been observed that some wavelet based methods forapproximate query processing may be optimal from a global perspective,but may provide very large error on some of the points in the stream [65].This is an issue which needs the design of robust metrics such as themaximum error metric for stream based wavelet computation.

Evolution Sensitive: Data Streams rarely show stable distributions, butrapidly evolve over time. Synopsis methods for static data sets are oftennot designed to deal with the rapid evolution of a data stream. For thispurpose, methods such as clustering [1] are used for the purpose of syn-opsis driven applications such as classification [2]. Carefully designedsynopsis structures can also be used for forecasting futuristic queries[3],with the use of evolution-sensitive synopsis.

There are a variety of techniques which can be used for synopsis constructionin data streams. We summarize these methods below:

Sampling methods: Sampling methods are among the most simplemethods for synopsis construction in data streams. It is also relativelyeasy to use these synopsis with a wide variety of application since theirrepresentation is not specialized and uses the same multi-dimensionalrepresentation as the original data points. In particular reservoir basedsampling methods [92] are very useful for data streams.

Histograms: Histogram based methods are widely used for static datasets. However most traditional algorithms on static data sets require


super-linear time and space. This is because of the use of dynamic pro-gramming techniques for optimal histogram construction. Their exten-sion to the data stream case is a challenging task. A number of recenttechniques [37] discuss the design of histograms for the dynamic case.

Wavelets: Wavelets have traditionally been used in a variety of image andquery processing applications. In this chapter, we will discuss the issuesand challenges involved in dynamic wavelet construction. In particular,the dynamic maintenance of the dominant coefficients of the waveletrepresentation requires some novel algorithmic techniques.

Sketches: Sketch-based methods derive their inspiration from wavelettechniques. In fact, sketch based methods can be considered a ran-domized version of wavelet techniques, and are among the most space-efficient of all methods. However, because of the difficulty of intuitiveinterpretations of sketch based representations, they are sometimes diffi-cult to apply to arbitrary applications. In particular, the generalization ofsketch methods to the multi-dimensional case is still an open problem.

Micro-cluster based summarization:A recent micro-clustering method[1] can be used be perform synopsis construction of data streams. Theadvantage of micro-cluster summarization is that it is applicable to themulti-dimensional case, and adjusts well to the evolution of the under-lying data stream. While the empirical effectiveness of the method isquite good, its heuristic nature makes it difficult to find good theoreticalbounds on its effectiveness. Since this method is discussed in detail inanother chapter of this book, we will not elaborate on it further.

In this chapter, we will provide an overview of the different methods for synopsisconstruction, and their application to a variety of data mining and databaseproblems. This chapter is organized as follows. In the next section, we willdiscuss the sampling method and its application to different kinds of data miningproblems. In section 3, we will discuss the technique of wavelets for dataapproximation. In section 4, we will discuss the technique of sketches fordata stream approximation. The method of histograms is discussed in section4. Section 5 discusses the conclusions and challenges in effective data streamsummarization.

2. Sampling Methods

Sampling is a popular tool used for many applications, and has several ad-vantages from an application perspective. One advantage is that samplingiseasy and efficient, and usually provides anunbiasedestimate of the underlyingdata withprovable error guarantees. Another advantage of sampling methods


is that since they use the original representation of the records, they areeasy touse with any data mining application or database operation. In most cases, theerror guarantees of sampling methods generalize to the mining behavior of theunderlying application. Many synopsis methods such as wavelets, histograms,and sketches are not easy to use for the multi-dimensional cases. The randomsampling technique is often the only method of choice for high dimensionalapplications.

Before discussing the application to data streams, let us examine some prop-erties of the random sampling approach. Let us assume that we have a databaseD containingN points which are denoted byX1 . . . XN . Let us assume thatthe functionf(D) represents an operation which we wish to perform on thedatabaseD. For examplef(D) may represent the mean or sum of one of theattributes in databaseD. We note that a random sampleS from databaseDdefines a random variablef(S) which is (often) closely related tof(D) formany commonly used functions. It is also possible to estimate the standarddeviation off(S) in many cases. In the case ofaggregation basedfunctionsin linear separable form (eg. sum, mean), the law of large numbers allows usto approximate the random variablef(S) as a normal distribution, and char-acterize the value off(D) probabilistically. However, not all functions areaggregation based (eg. min, max). In such cases, it is desirable to estimate themeanµ and standard deviationσ of f(S). These parameters allows us to designprobabilistic boundson the value off(S). This is often quite acceptable as analternative to characterizing the entire distribution off(S). Such probabilisticbounds can be estimated using a number of inequalities which are also oftenreferred to astail bounds.

The markov inequality is a weak inequality which provides the followingbound for the random variableX:

P (X > a) ≤ E[X]/a = µ/a (9.1)

By applying the Markov inequality to the random variable(X − µ)2/σ2, weobtain the Chebychev inequality:

P (|X − µ| > a) ≤ σ2/a2 (9.2)

While the Markov and Chebychev inequalities are farily general inequalities,they are quite loose in practice, and can be tightened when the distributionof the random variableX is known. We note that the Chebychev inequality isderived by applying the Markov inequality on a function of the random variableX. Even tighter bounds can be obtained when the random variableX showsa specific form, by applying the Markov inequality to parameterized functionsof X and optimizing the parameter using the particular characteristics of therandom variableX.


The Chernoff bound [14] applies whenX is the sum of several independentand identical Bernoulli random variables, and has a lower tail bound as well asan upper tail bound:

P (X < (1− δ)µ) ≤ e−µδ2/2 (9.3)

P (X > (1 + δ)µ) ≤ max2−δµ, e−µδ2/4 (9.4)

Another kind of inequality often used in stream mining is the Hoeffdinginequality. In this inequality, we bound the sum ofk independent boundedrandom variables. For example, for a set ofk independent random variableslying in the range[a, b], the sum of thesek random variablesX satisfies thefollowing inequality:

P (|X − µ| > δ) ≤ 2e−2k·δ2/(b−a)2 (9.5)

We note that the Hoeffding inequality is slightly more general than the Cher-noff bound, and both bounds have similar form for overlapping cases.Thesebounds have been used for a variety of problems in data stream mining suchasclassification, and query estimation [28, 58]. In general, the method of randomsampling is quite powerful, and can be used for a variety of problems such asorder statistics estimation, and distinct value queries [41, 72].

In many applications, it may be desirable to pick out a sample (reservoir)from the stream with a pre-decided size, and apply the algorithm of interestto this sample in order to estimate the results. One key issue in the case ofdata streams is that we are not sampling from a fixed data set withknown sizeN . Rather, the value ofN is unknown in advance, and the sampling must beperformed dynamically as data points arrive. Therefore, in order to maintainan unbiased representation of the underlying data, the probability of includinga point in the random sample should not be fixed in advance, but should changewith progression of the data stream. For this purpose, reservoir based samplingmethods are usually quite effective in practice.

2.1 Random Sampling with a Reservoir

Reservoir based methods [92] were originally proposed in the context ofone-pass access of data from magnetic storage devices such as tapes.As in thecase of streams, the number of recordsN are not known in advance and thesampling must be performed dynamically as the records from the tape are read.

Let us assume that we wish to obtain an unbiased sample of sizen fromthe data stream. In this algorithm, we maintain a reservoir of sizen from thedata stream. The firstn points in the data streams are added to the reservoirfor initialization. Subsequently, when the(t+ 1)th point from the data streamis received, it is added to the reservoir with probabilityn/(t + 1). In order


to make room for the new point, any of the current points in the reservoir aresampled with equal probability and subsequently removed.

The proof that this sampling approach maintains the unbiased character ofthe reservoir is straightforward, and uses induction ont. The probability of the(t + 1)th point being included in the reservoir isn/(t + 1). The probabilityof any of the lastt points being included in the reservoir is defined by the sumof the probabilities of the events corresponding to whether or not the(t+ 1)thpoint is added to the reservoir. From the inductive assumption, we know that thefirst t points have equal probability of being included in the reservoir and haveprobability equal ton/t. In addition, since the points remain in the reservoirwith equal probability of(n − 1)/n, the conditional probability of a point(among the firstt points) remaining in the reservoir given that the(t+ 1) pointis added is equal to(n/t) · (n−1)/n = (n−1)/t. By summing the probabilityover the cases where the(t+1)th point is added to the reservoir (or not), we get atotal probability of((n/(t+1))·(n−1)/t+(1−(n/(t+1)))·(n/t) = n/(t+1).Therefore, the inclusion of all points in the reservoir has equal probability whichis equal ton/(t+1). As a result, at the end of the stream sampling process, allpoints in the stream have equal probability of being included in the reservoir,which is equal ton/N .

In many cases, the stream data may evolve over time, and the correspondingdata mining or query results may also change over time. Thus, the results ofa query over a more recent window may be quite different from the resultsof a query over a more distant window. Similarly, the entire history of thedata stream may not relevant for use in a repetitive data mining applicationsuch as classification. Recently, the reservoir sampling algorithm was adaptedto sample from a moving window over data streams [8]. This is useful fordata streams, since only a small amount of recent history is more relevant thatthe entire data stream. However, this can sometimes be an extreme solution,since one may desire to sample from varying lengths of the stream history.While recent queries may be more frequent, it is also not possible to completelydisregard queries over more distant horizons in the data stream. A method in [4]designs methods forbiased reservoir sampling, which uses a bias function toregulate the sampling from the stream. This bias function is quite effective sinceit regulates the sampling in a smooth way so that queries over recent horizonsare more accurately resolved. While the design of a reservoir for arbitrarybias function is extremely difficult, it is shown in [4], that certain classes ofbias functions (exponential bias functions) allow the use of a straightforwardreplacement algorithm. The advantage of a bias function is that it can smoothlyregulate the sampling process so that acceptable accuracy is retained formoredistant queries. The method in [4] can also be used in data mining applicationsso that the quality of the results do not degrade very quickly.


2.2 Concise Sampling

The effectiveness of the reservoir based sampling method can be improvedfurther with the use of concise sampling. We note that the size of the reservoiris sometimes restricted by the available main memory. It is desirable to increasethe sample size within the available main memory restrictions. For this purpose,the technique of concise sampling is quite effective.

The method of concise sampling exploits the fact that the number ofdis-tinct values of an attribute is often significantly smaller than the size of the datastream. This technique is most applicable while performing univariate samplingalong a single dimension. For the case of multi-dimensional sampling, the sim-ple reservoir based method discussed above is more appropriate. The repeatedoccurrence of the same value can be exploited in order to increase the samplesize beyond the relevant space restrictions. We note that when the numberofdistinct values in the stream is smaller than the main memory limitations, theentire stream can be maintained in main memory, and therefore sampling maynot even be necessary. For current desktop systems in which the memorysizesmay be of the order of several gigabytes, very large sample sizes can bemainmemory resident, as long as the number of distinct values does not exceed thememory constraints. On the other hand, for more challenging streams with anunusually large number of distinct values, we can use the following approach.

The sample is maintained as a setS of<value, count> pairs. For those pairsin which the value of count is one, we do not maintain the count explicitly,but we maintain the value as asingleton. The number of elements in thisrepresentation is referred to as the footprint and is bounded above byn. Wenote that the footprint size is always smaller than or equal to than the true samplesize. If the count of any distinct element is larger than 2, then the footprintsizeis strictly smaller than the sample size. We use athreshold parameterτ whichdefines the probability of successive sampling from the stream. The value ofτ is initialized to be 1. As the points in the stream arrive, we add them to thecurrent sample with probability1/τ . We note that if the corresponding value-count pair is already included in the setS, then we only need to increment thecount by 1. Therefore, the footprint size does not increase. On the other hand,if the value of the current point is distinct from all the values encounteredsofar, or it exists as a singleton then the foot print increases by 1. This is becauseeither a singleton needs to be added, or a singleton gets converted to a value-count pair with a count of 2. The increase in footprint size may potentiallyrequire the removal of an element from sampleS in order to make room for thenew insertion. When this situation arises, we pick a new (higher) value of thethresholdτ ′, and apply this threshold to the footprint in repeated passes. In eachpass, we reduce the count of a value with probabilityτ/τ ′, until at least onevalue-count pair reverts to a singleton or a singleton is removed. Subsequent


Granularity (Order k) Averages DWT CoefficientsΦ values ψ values

k = 4 (8, 6, 2, 3, 4, 6, 6, 5) -k = 3 (7, 2.5, 5, 5.5) (1, -0.5,-1, 0.5)k = 2 (4.75, 5.25) (2.25, -0.25)k = 1 (5) (-0.25)

Table 9.1. An Example of Wavelet Coefficient Computation

points from the stream are sampled with probability1/τ ′. As in the previouscase, the probability of sampling reduces with stream progression, thoughwehave much more flexibility in picking the threshold parameters in this case.More details on the approach may be found in [41].

One of the interesting characteristics of this approach is that the sampleScontinues to remain an unbiased representative of the data stream irrespectiveof the choice ofτ . In practice,τ ′ may be chosen to be about10% larger thanthe value ofτ . The choice of different values ofτ provides different tradeoffsbetween the average (true) sample size and the computational requirements ofreducing the footprint size. In general, the approach turns out to be quite robustacross wide ranges of the parameterτ .

3. Wavelets

Wavelets [66] are a well known technique which is often used in databasesfor hierarchical data decomposition and summarization. A discussion of ap-plications of wavelets may be found in [10, 66, 89]. In this chapter, we willdiscuss the particular case of theHaar Wavelet. This technique is particularlysimple to implement, and is widely used in the literature for hierarchical de-composition and summarization. The basic idea in the wavelet technique is tocreate a decomposition of the data characteristics into a set of wavelet functionsand basis functions. The property of the wavelet method is that the higher ordercoefficients of the decomposition illustrate the broad trends in the data, whereasthe more localized trends are captured by the lower order coefficients.

We assume for ease in description that the lengthq of the series is a power of2. This is without loss of generality, because it is always possible to decomposea series into segments, each of which has a length that is a power of two. TheHaar Wavelet decomposition defines2k−1 coefficients of orderk. Each of these2k−1 coefficients corresponds to a contiguous portion of the time series of lengthq/2k−1. Theith of these2k−1 coefficients corresponds to the segment in theseries starting from position(i− 1) · q/2k−1 + 1 to positioni ∗ q/2k−1. Let usdenote this coefficient byψi

k and the corresponding time series segment bySik.

At the same time, let us define the average value of the first half of theSik by


(5)

(8, 6, 2, 3, 4, 6, 6, 5) 1

-0.5

-1

0.5

(7, 2.5, 5, 5.5) 2.25

-0.25

-0.25(4.75, 5.25)

5

Figure 9.1. Illustration of the Wavelet Decomposition

aik and the second half bybik. Then, the value ofψi

k is given by(aik − bik)/2.

More formally, if Φik denote the average value of theSi

k, then the value ofψik

can be defined recursively as follows:

ψik = (Φ2·i−1

k+1 − Φ2·ik+1)/2 (9.6)

The set of Haar coefficients is defined by theΨik coefficients of order1

to log2(q). In addition, the global averageΦ11 is required for the purpose of

perfect reconstruction. We note that the coefficients of different order provide anunderstanding of the major trends in the data at a particular level of granularity.For example, the coefficientψi

k is half the quantity by which the first half ofthe segmentSi

k is larger than the second half of the same segment. Sincelarger values ofk correspond to geometrically reducing segment sizes, one canobtain an understanding of the basic trends at different levels of granularity.We note that this definition of the Haar wavelet makes it very easy to computeby a sequence of averaging and differencing operations. In Table 9.1, wehave illustrated how the wavelet coefficients are computed for the case of thesequence(8, 6, 2, 3, 4, 6, 6, 5). This decomposition is illustrated in graphicalform in Figure 9.1. We also note that each value can be represented as asum of log2(8) = 3 linear decomposition components. In general, the entiredecomposition may be represented as a tree of depth 3, which represents the


5

-0.25

2.25 -0.25

1 -0.5 -1

-

0.5

+

+ -

+- + -

[1 2] [3 4] [5 6] [7 8]RELEVANT RANGES

RELEVANTRANGES

SERIESAVERAGE

[1 8]

[1 8]

[5 8][1 4]

8 6 2 3 4 6 6 5

ORIGINAL SERIES VALUES RECONSTRUCTED FROM TREE PATH

+ - + - + -+

Figure 9.2. The Error Tree from the Wavelet Decomposition

hierarchical decomposition of the entire series. This is also referred to astheerror tree, and was introduced in [73]. In Figure 9.2, we have illustrated theerror tree for the wavelet decomposition illustrated in Table 9.1. The nodesin the tree contain the values of the wavelet coefficients, except for a specialsuper-rootnode which contains the series average. This super-root node is notnecessary if we are only considering the relative values in the series, ortheseries values have been normalized so that the average is already zero.Wefurther note that the number of wavelet coefficients in this series is 8, whichis also the length of the original series. The original series has been replicatedjust below the error-tree in Figure 9.2, and it can be reconstructed by addingor subtracting the values in the nodes along the path leading to that value. Wenote that each coefficient in a node should be added, if we use the left branchbelow it to reach to the series values. Otherwise, it should be subtracted. Thisnatural decomposition means that an entire contiguous range along the seriescan be reconstructed by using only the portion of the error-tree which is relevantto it. Furthermore, we only need to retain those coefficients whose values aresignificantly large, and therefore affect the values of the underlying series. Ingeneral, we would like to minimize the reconstruction error by retaining onlya fixed number of coefficients, as defined by the space constraints.

We further note that the coefficients represented in Figure 9.1 are un-normalized.For a time seriesT , letW1 . . .Wt be the corresponding basis vectors of lengtht. In Figure 9.1, each component of these basis vectors is 0, +1, or -1. The list


of basis vectors in Figure 9.1 (in the same order as the corresponding waveletsillustrated) are as follows:

(1 -1 0 0 0 0 0 0)(0 0 1 -1 0 0 0 0)(0 0 0 0 1 -1 0 0)(0 0 0 0 0 0 1 -1)(1 1 -1 -1 0 0 0 0)(0 0 0 0 1 1 -1 -1)(1 1 1 1 -1 -1 -1 -1)

The most detailed coefficients have only one +1 and one -1, whereas themost coarse coefficient hast/2 +1 and -1 entries. Thus, in this case, we need23 − 1 = 7 wavelet vectors. In addition, the vector(11111111) is needed torepresent the special coefficient which corresponds to the series average. Then,if a1 . . . at be the wavelet coefficients for the wavelet vectorsW1 . . .Wt, thetime seriesT can be represented as follows:

T =t∑

i=1

ai ·Wi (9.7)

=t∑

i=1

(ai · |Wi|) ·Wi

|Wi|(9.8)

While ai is the un-normalized value from Figure 9.1, the valuesai · |Wi| rep-resent normalized coefficients. We note that the values of|Wi| are different forcoefficients of different orders, and may be equal to either

√2,√

4 or√

8 in thisparticular example. For example, in the case of Figure 9.1, the broadest level un-normalized coefficient is−0.25, whereas the corresponding normalized valueis−0.25 ·

√8. After normalization, the basis vectorsW1 . . .Wt are orthonor-

mal, and therefore, the sum of the squares of the corresponding (normalized)coefficients is equal to the energy in the time seriesT . Since the normalized co-efficients provide a new coordinate representation after axis rotation, euclidiandistances between time series are preserved in this new representation.

The total number of coefficients is equal to the length of the data stream.Therefore, for very large time series or data streams, the number of coeffi-cients is also large. This makes it impractical to retain the entire decompositionthroughout the computation. The wavelet decomposition method provides anatural method for dimensionality reduction, by retaining only the coefficientswith large absolute values. All other coefficients are implicitly approximatedto zero. This makes it possible to approximately represent the series with asmall number of coefficients. The idea is to retain only a pre-defined numberofcoefficients from the decomposition, so that the error of the reduced representa-tion is minimized. Wavelets are used extensively for efficient and approximate


query processing of different kinds of data [11, 93]. They are particularly usefulfor range queries, since contiguous ranges can easily be reconstructed with asmall number of wavelet coefficients. The efficiency of the query processingarises from the reduced representation of the data. At the same time, since onlythe small coefficients are discarded the results are quite accurate.

A key issue for the accuracy of the query processing is the choice of coef-ficients which should be retained. While it may be tempting to choose onlythe coefficients with large absolute values, this is not always the best choice,since a more judicious choice of coefficients can lead to minimizing specificerror criteria. Two such metrics are the minimization of the mean square erroror the maximum error metric. The mean square error minimizes theL2 errorin approximation of the wavelet coefficients, whereas maximum error metricsminimize the maximum error of any coefficient. Another related metric is therelative maximum error which normalizes the maximum error with the absolutecoefficient value.

It has been shown in [89] that the choice of largestB (normalized) coefficientsminimizes the mean square error criterion. This should also be evident from thefact that the normalized coefficients render an orthonormal decomposition, as aresult of which the energy in the series is equal to the sum of the squares of thecoefficients. However, the use of the mean square error metric is not withoutits disadvantages. A key disadvantage is that a global optimization criterionimplies that the local behavior of the approximation is ignored. Therefore, theapproximation arising from reconstruction can be arbitrarily poor for certainregions of the series. This is especially relevant in many streaming applicationsin which the queries are performed only over recent time windows. In manycases, the maximum error metric provides much more robust guarantees. Insuch cases, the errors are spread out over the different coefficients more evenly.As a result, the worst-case behavior of the approximation over differentqueriesis much more robust.

Two such methods for minimization of maximum error metrics are discussedin [38, 39]. The method in [38] is probabilistic, but its application of probabilis-tic expectation is questionable according to [53]. One feature of the methodin [38] is that the space is bounded only in expectation, and the variance inspace usage is large. The technique in [39] is deterministic and uses dynamicprogramming in order to optimize the maximum error metric. The key idea in[39] is to define a recursion over the nodes of the tree in top down fashion. Fora given internal node, we compute the least maximum error over the two casesof either keeping or not keeping a wavelet coefficient of this node. In each case,we need to recursively compute the maximum error for its two children overall possible space allocations among two children nodes. While the method isquite elegant, it is computationally intensive, and it is therefore not suitable forthe data stream case. We also note that the coefficient is defined according to


the wavelet coefficient definition i.e. half the difference between the left handand right hand side of the time series. While this choice of coefficient is optimalfor theL2 metric, this is not the case for maximum or arbitraryLp error metrics.

Another important topic in wavelet decomposition is that of the use of multi-ple measures associated with the time series. The problem of multiple measuresrefers to the fact that many quantities may simultaneously be tracked in a giventime series. For example, in a sensor application, one may simultaneously trackmany variables such as temperature, pressure and other parameters at each timeinstant. We would like to perform the wavelet decomposition over multiplemeasures simultaneously. The most natural technique [89] is to perform thedecomposition along the different measures separately and pick the largest co-efficients for each measure of the decomposition. This can be inefficient, sincea coordinate needs to be associated with each separately stored coefficient and itmay need to be stored multiple times. It would be more efficient to amortize thestorage of a coordinate across multiple measures. The trade-off is that while agiven coordinate may be the most effective representation for a particular mea-sure, it may not simultaneously be the most effective representation across allmeasures. In [25], it has been proposed to use an extended wavelet represen-tation which simultaneously tracks multi-measure coefficients of the waveletrepresentation. The idea in this technique is use a bitmap for each coefficientset to determine which dimensions are retained, and store all coefficients forthis coordinate. The technique has been shown to significantly outperformthemethods discussed in [89].

3.1 Recent Research on Wavelet Decomposition in DataStreams

The one-pass requirement of data streams makes the problem of waveletdecomposition somewhat more challenging. However, the case of optimizingthe mean square error criterion is relatively simple, since a choice of the largestcoefficients can preserve the effectiveness of the decomposition. Therefore, weonly need to dynamically construct the wavelet decomposition, and keep trackof the largestB coefficients encountered so far.

As discussed in [65], these methods can have a number of disadvantagesinmany situations, since many parts of the time series may be approximated verypoorly. The method in [39] can effectively perform the wavelet decomposi-tion with maximum error metrics. However, since the method uses dynamicprogramming, it is computationally intensive, it is quadratic in the length ofthe series. Therefore, it cannot be used effectively for the case ofdata streams,which require a one-pass methodology in linear time. in [51], it has been shownthat all weightedLm measures can be solved in a space-efficient manner usingonly O(n) space. In [65], methods have been proposed for one-pass wavelet


synopses with the maximum error metric. It has been shown in [65], that by us-ing a number of intuitive thresholding techniques, it is possible to approximatethe effectiveness of the technique discussed in [39]. A set of independent resultsobtained in [55] discuss how to minimize non-euclidean and relative error withthe use of wavelet synopses. This includes metrics such as theLp error or therelative error. Both the works of [65] and [55] were obtained independentlyand at a similar time. While the method in [65] is more deeply focussed on theuse of maximum error metrics, the work in [55] also provides some worst casebounds on the quality of the approximation. The method of [65] depends onexperimental results to illustrate the quality of the approximation. Another in-teresting point made in [55] is that most wavelet approximation methods solve arestricted version of the problem in which the wavelet coefficient for the basis isdefined to be half the difference between the left hand and right hand side of thebasis vectors. Thus, the problem is only one of picking the bestB coefficientsout of these pre-defined set of coefficients. While this is an intuitive methodfor computation of the wavelet coefficient, and is optimal for the case of theEuclidean error, it is not necessarily optimal for the case of theLm-metric. Forexample, consider the time series vector(1, 4, 5, 6). In this case, the wavelettransform is(4,−1.5,−1.5,−0.5). Thus, forB = 1, the optimal coefficientpicked is(4, 0, 0, 0) for anyLm-metric. However, for the case ofL∞-metric,the optimal solution should be(3.5, 0, 0, 0), since3.5 represents the averagebetween the minimum and maximum value. Clearly, any scheme which re-stricts itself only to wavelet coefficients defined in a particular way will noteven consider this solution [55]. Almost all methods for non-euclidean waveletcomputation tend to use this approach, possibly as a legacy from the Haarmethod of wavelet decomposition. This restriction has been removed in [55]and proposes a method for determining the optimalsynopsis coefficientsfor thecase of the weightedLm metric. We distinguish between synopsis coefficientsand wavelet coefficients, since the latter are defined by the simple subtractivemethodology of the Haar decomposition. A related method was also proposedby Matias and Urieli [75] which discusses a near linear time optimal algorithmfor the weightedLm-error. This method is offline, and chooses a basis vectorwhich depends upon the weights.

An interesting extension of the wavelet decomposition method is one inwhich multiple measuresare associated with the time series. A natural solu-tion is to treat each measure separately, and store the wavelet decomposition.However, this can be wasteful, since a coordinate needs to be stored with eachcoefficient, and we can amortize this storage by storing the same coordinateacross multiple measures. A technique in [25] proposes the concept ofex-tended waveletsin order to amortize the coordinate storage across multiplemeasures. In this representation, one or more coefficients are stored witheachcoordinate. Clearly, it can be tricky to determine which coordinates to store,


since different coordinates will render larger coefficients across different mea-sures. The technique in [25] uses a dynamic programming method to determinethe optimal extended wavelet decomposition. However, this method is not timeand space efficient. A method in [52] provides a fast algorithm whose spacerequirement is linear in the size of the synopsis and logarithmic in the size ofthe data stream.

Another important point to be noted is that the choice of the best waveletdecomposition is not necessarily pre-defined, but it depends upon the particularworkload on which the wavelet decomposition is applied. Some interestingpapers in this direction [77, 75] design methods for workload aware waveletsynopses of data streams. While this line of work has not been extensivelyre-searched, we believe that it is likely to be fruitful in many data stream scenarios.

4. Sketches

The idea of sketches is essentially an extension of the random projectiontechnique [64] to the time series domain. The idea of using this technique fordetermining representative trends in the time series domain was first observedin [61]. In the method of random projection, we can reduce a data point ofdimensionalityd to an axis system of dimensionalityk by pickingk randomvectors of dimensionalityd and calculating the dot product of the data pointwith each of these random vectors. Each component of thek random vectorsis drawn from the normal distribution with zero mean and unit variance. Inaddition, the random vector is normalized to one unit in magnitude. It hasbeen shown in [64] that proportionalL2 distances between the data points areapproximately preserved using this transformation. The accuracy bounds of thedistance values are dependent on the value ofk. The larger the chosen value ofk, the greater the accuracy and vice-versa.

This general principle can be easily extended to the time series domain,by recognizing the fact that the length of a time series may be treated as itsdimensionality, and correspondingly we need to compute a random vector oflength equal to the time series, and use it for the purpose of sketch computation.If desired, the same computation can be performed over a sliding window of agiven length by choosing a random vector of appropriate size. As proposed in[61], the following approximation bounds are preserved:

Lemma 9.1 LetL be a set of vectors of lengthl, for fixedǫ < 1/2, andk =9 · log|L|/ǫ2. Consider a pair of vectorsu,w in L, such that the correspondingsketches are denoted byS(u) andS(w) respectively. Then, we have:

(1− ǫ) · ||u− w||2 ≤ ||S(u)− S(w)|| ≤ (1 + ǫ) · ||u− w||2 (9.9)

with probability1/2. Here||U − V ||2 is theL2 distance between two vectorsU andV .


The generalization to time series is fairly straightforward, and the work in[61] makes two primary contributions in extending the sketch methodology tofinding time series trends.

4.1 Fixed Window Sketches for Massive Time Series

In this case, we wish to determine sliding window sketches with a fixed win-dow lengthl. For each window of lengthl, we need to performl · k operationsfor a sketch of sizek. Since there areO(n − l) sliding windows, this willrequireO(n · l · k) computations. Whenl is large, and is of the same order ofmagnitude as the time series, the computation may be quadratic in the size of theseries. This can be prohibitive for very large time series, as is usually the casewith data streams. The key observation in [61], is that all such sketches can beviewed as the problem of computing the polynomial convolution of the randomvector of appropriate length with the time series. Since the problem of poly-nomial convolution can be computed efficiently using fast fourier transform,this also means that the sketches may be computed efficiently. The problem ofpolynomial convolution is defined as follows:

Definition 9.2 Given two vectorsA[1 . . . a] andB[1 . . . b], a ≥ b, theirconvolution is the vectorC[1 . . . a+ b] whereC[k] =

∑bi=1A[k− i] ·B[i] for

k ∈ [2, a+ b], with any out of range references assumed to be zero.

The key point here is that the above polynomial convolution can be computedusing FFT, inO(a · log(b)) operations rather thanO(a · b) operations. Thiseffectively means the following:

Lemma 9.3 Sketches of all subvectors of lengthl can be computed in timeO(n · k · log(l)) using polynomial convolution.

4.2 Variable Window Sketches of Massive Time Series

The method in the previous subsection discussed the problem of sketch com-putation for a fixed window length. The more general version of the problem isone in which we wish to compute the sketch for any subvector between lengthl andu. In the worst-case this comprisesO(n2) subvectors, most of whichwill have lengthO(n). Therefore, the entire algorithm may requireO(n3)operations, which can be prohibitive for massive time series streams.

The key idea in [61] is to store apool of sketches. The size of this poolis significantly smaller than the entire set of sketches needed. However, it iscarefully chosen so that the sketch of any sub-vector in the original vector canbe computed inO(1) time fairly accurately. In fact, it can be shown that theapproximate sketches computed using this approach satisfy a slightly relaxedversion of Lemma 9.1. We refer details to [61].


4.3 Sketches and their applications in Data Streams

In the previous sections we discussed the application of sketches to the prob-lem of massive time series. Some of the methods such as fixed window sketchcomputation are inherently offline. This does not suffice in many scenariosinwhich it is desirable to continuously compute the sketch over the data stream.Furthermore, in many cases, it is desirable to efficiently use this sketch in orderto work with a variety of applications such as query estimation. In this subsec-tion, we will discuss the applications of sketches in the data stream scenario.Our earlier discussion corresponds to a sketch of thetime seriesitself, and en-tails the storage of the random vector required for sketch generation. While sucha technique can be used effectively for massive time series, it cannot always beused for time series data streams.

However, in certain other applications, it may be desirable to track thefre-quenciesof the distinct values in the data stream. In this case, if(u1 . . . uN )be the frequencies ofN distinct values in the data stream, then the sketch isdefined by the dot product of the vector(u1 . . . uN ) with a random vector ofsizeN . As in the previous case, the number of distinct itemsN may be large,and therefore the size of the corresponding random vector will also be large. Anatural solution is to pre-generate a set ofk random vectors, and whenever theith item is received, we addrj

i to thejth sketch component. Therefore, thekrandom vectors may need to be pre-stored in order to perform the computation.However, the explicit storage of the random vector will defeat the purpose ofthe sketch computation, because of the high space complexity.

The key here is to store the random vectors implicitly in the form of a seed,which can be used to dynamically generate the vector. The key idea discussedin [6] is that it is possible to generate the random vectors from a seed of sizeO(log(N)) provided that one is willing to work with the restriction that thevalues ofrj

i ∈ −1,+1 are only 4-wise independent. We note that havinga seed of small size is critical in terms of the space-efficiency of the method.Furthermore, it has been shown in [6] that the theoretical results only require4-wise independence. In [44], it has also been shown how to use Reed-Mullercodes in order to generate 7-wise independent random numbers. These methodsuffices for the purpose of wavelet decomposition of the frequency distributionof different items.

Some key properties of the pseudo-random number generation approach andthe sketch representation are as follows:

A given componentrji can be generated in poly-logarithmic time from

the seed.

The dot-product of two vectors can be approximately computed usingonly their sketch representations. This follows from the fact that the


dot product of two vectors is closely related to the Euclidean distance,a quantity easily approximated by the random projection approach [64].Specifically, ifU andV be two (normalized) vectors, then the euclideandistance and dot product are related as follows:

||U − V ||2 = ||U ||2 + ||V ||2 − 2 · U · V (9.10)

(9.11)

This relationship can be used to establish bounds on the quality of the dotproduct approximation of the sketch vectors. We refer to [44] for detailsof the proof.

The first property ensures that the sketch components can be updated and main-tained efficiently. Whenever theith value is received, we only need to addrj

i tothejth component of the sketch vector. Since the quantityrj

i can be efficientlycomputed, it follows that the update operations can be performed efficientlyas well. In the event that the data stream also incorporates frequency countswith the arriving items (itemi is associated with frequency countf(i)), thenwe simply need to addf(i) · rj

i to thejth sketch component. We note thatthe efficient and accurate computation of the dot product of a given time serieswith the random vector is a key primitive which can be used to compute manyproperties such as the wavelet decomposition. This is because each waveletcoefficient can be computed as the dot product of the wavelet basis with thecorresponding time series data stream; an approximation may be determinedby using only their sketches. The key issue here is that we also need the sketchrepresentation of the wavelet basis vectors, each of which may takeO(N) timein the worst case. In general, this can be time consuming; however the workin[44] shows how to do this in poly-logarithmic time for the special case in whichthe vectors are Haar-basis vectors. Once the coefficients have been computed,we only need to retain theB coefficients with the highest energy.

We note that one property of the results in [44] is that it uses the sketchrepresentation of the frequency distribution of theoriginal streamin order toderive the wavelet coefficients. A recent result in [16] worksdirectly with thesketch representation of the wavelet coefficients rather than the sketch repre-sentation of the original data stream. Another advantage of the work in [16]is that the query times are much more efficient, and the work extends to themulti-dimensional domain. We note that while the wavelet representation in[44] is space efficient, the entire synopsis structure may need to be touched forupdates and every wavelet coefficient must be touched in order to find the bestB coefficients. The technique in [16] reduces the time and space efficiencyforboth updates and queries.

The method of sketches can be effectively used for second moment and joinestimation. First, we discuss the problem of second moment estimation [6] and


illustrate how it can be used for the problem of estimating the size of self joins.Consider a set ofn quantitative valuesU = (u1 . . . uN ). We would like toestimate the second moment|U |2. Then, as before generate the random vectorsr1 . . . rk, (each of sizeN ), and compute the dot product of these random vectorswith U to create the sketch components denoted byS1 . . . Sk. Then, it can beshown that the expected value ofS2

i is equal to the second moment. In fact, theapproximation can be bounded with high probability.

Lemma 9.4 By selecting the median ofO(log(1/δ)) averages ofO(1/ǫ2)copies ofS2

i , it is possible to guarantee the accuracy of the sketch based ap-proximation to within1+ǫ with probability at least1− δ.In order to prove the above result, the first step is to show that the expectedvalue ofS2

i is equal to the second moment, and the variance of the variableS2

i is at most twice the square of the expected value. The orthogonality ofthe random projection vectors can be used to show the first result and the4-wise independence of the values ofrj

i can be used to show the second. Therelationship between the expected values and variance imply that the Chebychevinequality can be used to prove that the average ofO(1/ǫ2) copies providesthe ǫ bound with a constant probability which is at least7/8. This constantprobability can be tightened to at least1 − δ (for any small value ofδ) withthe use of the median ofO(log(1/δ)) independent copies of these averages.This is because the median would lie outside theǫ-bound only if more thanlog(1/δ)/2copies (minimum required number of copies) lie outside theǫbound.However, the expected number of copies which lie outside theǫ-bound is onlylog(1/δ)/8, which is less than above-mentioned required number of copies by3 · log(1/δ)/8. The Chernoff tail bounds can then be applied on the randomvariable representing the number of copies lying outside theǫ-bound. This canbe used to show that the probability of more than half the log(1/δ) copies lyingoutside theǫ-bound is at mostδ. Details of the proof can be found in [6].

We note that the second moment immediately provides an estimation forself-joins. If ui be the number of items corresponding to theith value, thenthe second moment estimation is exactly the size of the self-join. We furthernote that the dot product function is not the only one which can be estimatedfrom the sketch. In general, many functions such as the dot product, theL2

distance, or the maximum frequency items can be robustly estimated from thesketch. This is essentially because the sketch simply projects the time seriesonto a new set of (expected) orthogonal vectors. Therefore many rotationalinvariant properties such as theL2 distance, dot product, or second moment areapproximately preserved by the sketch.

A number of interesting techniques have been discussed in [5, 26, 27] inorder to perform the estimation more effectively over general joins and multi-joins. Consider the multi-join problem on relationsR1, R2, R3, in which we


wish to join attributeA ofR1 with attributeB of R2, and attributeC ofR2 withattributeD of R3. Let us assume that the join attribute onR1 with R2 hasNdistinct values, and the join attribute ofR2 with R3 hasM distinct values. Letf(i) be the number of tuples inR1 with valuei for attributeA. Letg(i, j) be thenumber of tuples inR2 with valuesi andj for attributesB andC respectively.Let h(j) be the number of tuples inR3 with valuej for join attributeC. Then,the total estimated join sizeJ is given by the following:

J =N∑

i=1

M∑

j=1

f(i) · g(i, j) · h(j) (9.12)

In order to estimate the join size, we createtwo independentlygenerated fam-ilies of random vectorsr1 . . . rk ands1 . . . sk. We dynamically maintain thefollowing quantities, as the stream points are received:

Zj1 =

N∑

i=1

f(i) · rji (9.13)

Zj2 =

N∑

i=1

M∑

k=1

g(i, k) · rjk · s

jk (9.14)

Zj3 =

M∑

k=1

h(k) · sjk (9.15)

It can be shown [5], that the quantityZj1 · Zj

2 · Zj3 estimates the join size. We

can use the multiple components of the sketch (different values ofj) in order toimprove the accuracy. It can be shown that the variance of this estimate is equalto theproduct of the self-join sizesfor the three different relations. Since the tailbounds use the variance in order to provide quality estimates, a large value ofthe variance can reduce the effectiveness of such bounds. This is particularly aproblem if the composite join has a small size, whereas the product of the self-join sizes is very large. In such cases, the errors can be very large in relation tothe size of the result itself. Furthermore, the product of self-join sizes increaseswith the number of joins. This degrades the results. We further note that theerror bound results for sketch based methods are proved with the use oftheChebychev inequality, which depends upon a low ratio of the variance to resultsize. A high ratio of variance to result size makes this inequality ineffective,and therefore the derivation of worst-case bounds requires a greater number ofsketch components.

An interesting observation in [26] is that ofsketch partitioning. In thistechnique, we intelligently partition the join attribute domain-space and useit in order to compute separate sketches of each partition. The resulting join


estimate is computed as the sum over all partitions. The key observation hereis that intelligent domain partitioning reduces the variance of the estimate, andis therefore more accurate for practical purposes. This method has also beendiscussed in more detail for the problem of multi-query processing [27].

Another interesting trick for improving join size estimation is that ofsketchskimming[34]. The key insight is that the variance of the join estimation ishighly affected by the most frequent components, which are typically small innumber. A high variance is undesirable for accurate estimations. Therefore, wetreat the frequent items in the stream specially, and can separately track them. Askimmed sketch can be constructed by subtracting out the sketch componentsof these frequent items. Finally, the join size can be estimated as a 4-wiseaddition of the join estimation across two pairs of partitions. It has been shownthat this approach provides superior results because of the reduced variance ofthe estimations from the skimmed sketch.

4.4 Sketches withp-stable distributions

In our earlier sections, we did not discuss the effect of the distribution fromwhich the random vectors are drawn. While the individual components of therandom vector were drawn from the normal distribution, this is not the onlypossibility for sketch generation. In this section, we will discuss a specialset of distributions for the random vectors which are referred to asp-stabledistributions. A distributionL is said to bep-stable, if it satisfies the followingproperty:

Definition 9.5 For any set ofN i.i.d. random variablesX1 . . . XN drawnfrom ap-stable distributionL, and any set of real numbersa1 . . . aN , the randomvariable(

∑Ni=1 ai ·Xi)/(

∑Ni=1 a

pi )

(1/p) is drawn fromL.

A classic example of thep-stable distribution is the normal distribution withp = 2. In generalp-stable distributions can be defined forp ∈ (0, 2].

The use ofp-stable distributions has implications in the construction ofsketches. Recall, that theith sketch component is of the form

∑Ni=1 uj · rj

i ,whereui is the frequency of theith distinct value in the data stream. If eachrji is drawn from ap-stable distribution, then the above sum is also a (scaled)p-stable distribution, where the scale coefficient is given by(

∑Ni=1 u

pi )

(1/p).The ability to use theexact distributionof the sketch provides much strongerresults than just the use of mean and variance of the sketch components. Wenote that the use of only mean and variance of the sketch components oftenrestricts us to the use of generic tail bounds (such as the Chebychev inequality)which may not always be tight in practice. However, the knowledge of thesketch distribution can potentially provide very tight bounds on the behaviorofeach sketch component.


An immediate observation is that the scale coefficient(∑N

i=1 upi )

(1/p) ofeach sketch component is simply theLp-norm of the frequency distribution ofthe incoming items in the data stream. By usingO(log(1/δ)/ǫ2) independentsketch components, it is possible to approximate theLp norm within ǫ withprobability at least1− δ. We further note that the use of theL0 norm providesthe number of distinct values in the data stream. It has been shown in [17] thatby usingp → 0 (small values ofp), it is possible to closely approximate thenumber of distinct values in the data stream.

Other Applications of Sketches. The method of sketches can be used for avariety of other applications. Some examples of such applications include theproblem ofheavy hitters[13, 18, 76, 21], a problem in which we determine themost frequent items over data streams. Other problems include those of findingsignificant network differences over data streams [19] and finding quantiles[46, 50] over data streams. Another interesting application is that of significantdifferences between data streams [32, 33], which has applications in numerouschange detection scenarios. Another recent application to sketches hasbeen toXML and tree-structured data [82, 83, 87]. In many cases, these synopses canbe used for efficient resolution of the structured queries which are specified inthe XQuery pattern-specification language.

Recently sketch based methods have found considerable applications to ef-ficient communication of signals in sensor networks. Since sensors are batteryconstrained, it is critical to reduce the communication costs of the transmission.The space efficiency of the sketch computation approach implies that it can alsobe used in the sensor network domain in order to minimize the communicationcosts over different processors. In [22, 67, 50], it has been shown how to extendthe sketch method to distributed query tracking in data streams. A particularlyinteresting method is the technique in [22] which reduces the communicationcosts further by using sketch skimming techniques [34], in order to reducecom-munication costs further. The key idea is to use models to estimate the futurebehavior of the sketch, and make changes to the sketch only when there aresignificant changes to the underlying model.

4.5 The Count-Min Sketch

One interesting variation of the sketching method for data streams is thecount-min sketch, which uses a hash-based sketch of the stream. The broad ideasin the count-min sketch were first proposed in [13, 29, 30]. Subsequently, themethod was enhanced with pairwise-independent hash functions, formalized,and extensively analyzed for a variety of applications in [20].

In the count-min sketch, we use⌈ln(1/δ)⌉ pairwise independent hash func-tions, each of which map on to uniformly random integers in the range[0, e/ǫ],


wheree is the base of the natural logarithm. Thus, we maintain a total of⌈ln(1/δ)⌉ hash tables, and there are a total ofO(ln(1/δ)/ǫ) hash cells. Thisapparently provides a better space complexity than theO(ln(1/δ)/ǫ2) boundof AMS sketches in [6]. We will discuss more on this point later.

We apply each hash function to any incoming element in the data stream, andadd the count of the element to each of the corresponding⌈ln(1/δ)⌉ positionsin the different hash tables. We note that because of collisions, the hash tablecounts will not exactly correspond to the count of any element in the incomingdata stream. When incoming frequency counts are non-negative, the hash tablecounts will over-estimate the true count, whereas when the incoming frequencycounts are either positive or negative (deletions), the hash table count could beeither an over-estimation or an under-estimation. In either case, the use of themedian count of the hash position of a given element among theO(ln(1/δ))counts provided by the different hash functions provides a estimate whichiswithin a3 · ǫ factor of theL1-norm of element counts with probability at least1 − δ1/4 [20]. In other words, if the frequencies of theN different items aref1 . . . fN , then the estimated frequency of the itemi lie betweenfi − 3 · ǫ ·∑N

i=1 |fi| andfi +3 · ǫ ·∑Ni=1 |fi|with probability at least1− δ1/4. The proof

of this result relies on the fact that the expected inaccuracy of a given entry jis at mostǫ ·∑N

i=1 |fi|/e, if the hash function is sufficiently uniform. This isbecause we expect the count of other (incorrect) entries which map ontotheposition ofj to be

∑i∈[1,N ],i6=j fi · ǫ/e for a sufficiently uniform hash function

with ⌈e/ǫ⌉ entries. This is at most equal toǫ ·∑Ni=1 |fi|/e. By the Markov

inequality, the probability of this number exceeding3 · ǫ ·∑Ni=1 |fi| is less than

1/(3·e) < 1/8. By using the earlier Chernoff bound trick (as in AMS sketches)in conjunction with the median selection operation, we get the desired result.

In the case of non-negative counts, theminimum countof any of the ln(1/δ)possibilities provides a tighterǫ-bound (of theL1-norm) with probability atleast1− δ. In this case, the estimated frequency of itemi lies betweenfi andfi + ǫ ·∑N

i=1 fi with probability at least1 − δ. As in the previous case, theexpected inaccuracy isǫ ·∑N

i=1 fi/e. This is less than the maximum bound bya factor ofe. By applying the Markov inequality, it is clear that the probabilitythat the bound is violated for a given entry is1/e. Therefore, the probabilitythat it is violated by all log(1/δ) entries is at most(1/e)log(1/δ) = δ.

For the case of non-negative vectors, the dot product can be estimatedbycomputing the dot product on the corresponding entries in the hash table. Eachof the ⌈ln(1/δ)⌉ such dot products is an over estimate, and the minimum ofthese provides anǫ bound with probability at least1−δ. The dot product resultimmediately provides bounds for join size estimation. Details of extending themethod to other applications such as heavy hitters and quantiles may be found


in [20]. In many of these methods, the time and space complexity is boundedabove byO(ln(1/δ)/ǫ), which is again apparently superior to the AMS sketch.

As noted in [20], theǫ-bound in the count-min sketch cannot be directly com-pared with that of the AMS sketch. This is because the AMS sketch providestheǫ-bound as a function of theL2-norm, whereas the method in [20] provides theǫ-bound only in terms of theL1-norm. TheL1-norm can bequadraticallylarger(than theL2-norm) in the most challenging case of non-skewed distributions,and the ratio between the two may be as large as

√N . Therefore, theequivalent

value ofǫ in the count-min sketch can be smaller than that in the AMS sketchby a factor of

√N . SinceN is typically large, and is in fact the motivation of

the sketch-based approach, theworst-casetime and space complexity of a trulyequivalent count-min sketch may besignificantlylarger for practical values ofǫ. While this observation has been briefly mentioned in [20], there seems to besome confusion on this point in the current literature. This is because of theoverloaded use of the parameterǫ, which has different meaning for the AMSand count-min sketches. For the skewed case (which is quite common), theratio of theL1-norm to theL2-norm reduces. However, since this case is lesschallenging, the general methods no longer remain relevant, and a number ofother specialized methods (eg. sketch skimming [34]) exist in order to improvethe experimental and worst-case effectiveness of both kinds of sketches. Itwould be interesting to experimentally compare the count-min and AMS meth-ods to find out which is superior in different kinds of skewed and non-skewedcases. Some recent results [91] seem to suggest that the count-min sketch isexperimentally superior to the AMS sketch in terms of maintaining counts ofelements. On the other hand, the AMS sketch seems to be superior in termsof estimating aggregate functions such as theL2-norm. Thus, the count-minsketch does seem to have a number of practical advantages in many scenarios.

4.6 Related Counting Methods: Hash Functions forDetermining Distinct Elements

The method of sketches is a probabilistic counting method whereby a ran-domized function is applied to the data stream in order to perform the countingin a space-efficient way. While sketches are a good method to determinelargeaggregate signals, they are not very useful for counting infrequently occur-ring items in the stream. For example, problems such as the determination ofthe number of distinct elements cannot be performed with sketches. For thispurpose, hash functions turn out to be a useful choice.

Consider a hash function that renders a mapping from a given word to aninteger in the range[0, 2L − 1]. Therefore, the binary representation of thatinteger will have lengthL. The position (least significant and rightmost bit iscounted as 0) of the rightmost 1-bit of the binary representation of that integer


is tracked, and the largest such value is retained. This value is logarithmicallyrelated to the number of distinct elements [31] in the stream.

The intuition behind this result is quite simple. For a sufficiently uniformlydistributed hash function, the probability of theith bit on the right taking on thefirst 1-value is simply equal to2−i−1. Therefore, forN distinct elements, theexpected number of records taking on theith bit as the first 1-value is2−i−1 ·N .Therefore, wheni is picked larger than log(N), the expected number of suchbitstrings falls off exponentially less than 1. It has been rigorously shown[31]that the expected position of the rightmost bitE[R] is logarithmically relatedto the number of distinct elements as follows:

E[R] = log2(φN), φ = 0.77351 (9.16)

The standard deviationσ(R) = 1.12. Therefore, the value ofR provides anestimate for the number of distinct elementsN .

The hash function technique is very useful for those estimations in whichnon-repetitive elements have the same level of importance as repetitive ele-ments. Some examples of such functions are those of finding distinct values[31, 43], mining inverse distributions [23], or determining the cardinality of setexpressions [35]. The method in [43] uses a technique similar to that discussedin [31] in order to obtain a random sample of the distinct elements. This is thenused for estimation. In [23], the problem of inverse distributions is discussed,in which it is desirable to determine the elements in the stream with a particularfrequency level. Clearly such an inverse query is made difficult by the factthat a query for an element with very low frequency is equally likely to that ofan element with very high frequency. The method in [23] solves this problemusing a hash based approach similar to that discussed in [31]. Another relatedproblem is that of finding the number ofdistinct elementsin a join after elim-inating duplicates. For this purpose, a join-distinct sketch (or JD-Sketch)wasproposed in [36], which uses a 2-level adaptation of the hash function approachin [31].

4.7 Advantages and Limitations of Sketch Based Methods

One of the key advantages of sketch based methods is that they require spacewhich is sublinear in the data size being considered. Another advantage ofsketch based methods that it is possible to maintain sketches in the presenceof deletions. This is often not possible with many synopsis methods such asrandom samples. For example, when theith item with frequencyf(i) is deleted,thejth component of the sketch can be updated by subtractingf(i) · rj

i from it.Another advantage of using sketch based methods is that they are extraordinarilyspace efficient, and require space which is logarithmic in thenumber of distinctitemsin the stream. Since the number of distinct items is significantly smallerthan the size of the stream itself, this is an extremely low space requirement.


We note that sketches are based on the Lipshitz embeddings, which preserve anumber of aggregate measures such as theLp norm or the dot product. However,the entire distribution on the data (including the local temporal behavior) arenot captured in the sketch representation, unless one is willing to work with amuch larger space requirement.

Most sketch methods are based on analysis along a single dimensional streamof data points. Many problems in the data stream scenario are inherently multi-dimensional, and may in fact involve hundreds or thousands of independent andsimultaneous data streams. In such cases, it is unclear whether sketch basedmethods can be easily extended. While some recent work in [16] provides afewlimited methods for multi-dimensional queries, these are not easily extensiblefor more general problems. This problem is not however unique to sketch basedmethods. Many other summarization methods such as wavelets or histogramscan be extended in a limited way to the multi-dimensional case, and do notwork well beyond dimensionalities of 4 or 5.

While the concept of sketches is potentially powerful, one may questionwhether sketch based methods have been used for the right problems in thedatastream domain. Starting with the work in [6], most work on sketches focuseson theaggregate frequencybehavior of individual items rather than the tempo-ral characteristics of the stream. Some examples of such problems are thoseof finding the frequent items, estimation offrequency moments, andjoin sizeestimation. The underlying assumption of these methods is an extremely largedomain sizeof the data stream. The actual problems solved (aggregate fre-quency counts, join size estimation, moments) are relatively simple for modestdomain sizes in many practical problems over very fast data streams. In thesecases, temporal information in terms of sequential arrival of items is aggregatedand therefore lost. Some sketch-based techniques such as those in [61]performtemporal analysis over specific time windows. However, this method has muchlarger space requirements. It seems to us that many of the existing sketch basedmethods can be easily extended to the temporal representation of the stream. Itwould be interesting to explore how these methods compare with other synopsismethods for temporal stream representation.

We note that the problem of aggregate frequency counts is made difficultonly by the assumption ofvery large domain sizes, and not by the speed ofthe stream itself. It can be argued that in most practical applications, the datastream itself may be very fast, but the number of distinct items in the streammay be of manageable size. For example, a motivating application in [44] usesthe domain of call frequencies of phone records, an application in which thenumber of distinct items is bounded above by the number of phone numbers ofa particular phone company. With modern computers, it may even be possibleto hold the frequency counts of a few million distinct phone numbers in a mainmemory array. In the event that main memory is not sufficient, many efficient


disk based index structures may be used to index and update frequency counts.We argue that many applications in the sketch based literature which attempts tofind specific properties of the frequency counts (eg. second moments, joinsizeestimation, heavy hitters) may in fact be implemented trivially by using simplemain memory data structures, and the ability to do this will only increase overtime with hardware improvements. There are however a number of applica-tions in which hardware considerations make the applications of sketch basedmethods very useful. In our view, the most fruitful applications of sketch basedmethods lie in its recent application to the sensor network domain, in whichin-network computation, storage and communication are greatly constrained bypower and hardware considerations [22, 67, 68]. Many distributed applicationssuch as those discussed in [9, 24, 70, 80] are particularly suited to this approach.

5. Histograms

Another key method for data summarization is that of histograms. In themethod of histograms, we essentially divide the data along any attribute into a setof ranges, and maintain the count for each bucket. Thus, the space requirementis defined by the number of buckets in the histogram. A naive representationof a histogram would discretize the data into partitions of equal length (equi-width partitioning) and store the frequencies of these buckets. At this point,wepoint out a simple connection between the histogram representation and Haarwavelet coefficients. If we construct the wavelet representation of thefrequencydistributionof a data set along any dimension, then the (non-normalized) Haarcoefficients of any order provide the difference in relative frequencies in equi-width histogram buckets. Haar coefficients of different orders correspond tobuckets of different levels of granularity.

It is relatively easy to use the histogram for answering different kinds ofqueries such as range queries, since we only need to determine the set ofbucketswhich lie within the user specified ranges [69, 81]. A number of strategies canbe devised for improved query resolution from the histogram [69, 81, 84, 85].

The key source of inaccuracy in the use of histograms is that the distributionof the data points within a bucket is not retained, and is therefore assumed tobeuniform. This causes inaccuracy because of extrapolation at the querybound-aries which typically contain only a fractional part of a histogram. Thus, animportant design consideration in the construction of histograms is the determi-nation of how the buckets in the histogram should be designed. For example,ifeach range is divided into equi-width partitions, then the number of data pointswould be distributed very unequally across different buckets. If suchbuck-ets include the range boundary of a query, this may lead to inaccurate queryestimations.


Therefore, a natural choice is to pick equi-depth buckets, in which eachrangecontains an approximately equal number of points. In such cases, the maximuminaccuracy of a query is equal to twice the count in any bucket. However, inthe case of a stream, the choice of ranges which would result in equi-depthpartitions is not known a-priori. We note that the design of equi-depth bucketsis exactly the problem of quantile estimation in data streams, since the equi-depth partitions define different quantiles in the data.

A different choice for histogram construction is that of minimizing the fre-quency variance of the different values within a bucket, so that the uniformdistribution assumption is approximately held for queries. This minimizes theboundary error of extrapolation in a query. Thus, if a bucketB with countC(B) contains the frequency ofl(B) elements, then average frequency of eachelement in the bucket isC(B)/l(B). Letf1 . . . fl(B) be the frequencies of thelvalues within the bucket. Then, the total variancev(B) of the frequencies fromthe average is given by:

v(B) =l∑

i=1

(fi − C(B)/l(B))2 (9.17)

Then, the total varianceV across all buckets is given by the following:

V =∑

B

v(B) (9.18)

Such histograms are referred to asV-Optimal histograms. A different way oflooking at the V-optimal histogram is as a least squares fit to the frequencydistribution in the data. Algorithms for V-Optimal histogram construction havebeen proposed in [60, 63]. We also note that the objective function to be op-timized has the form of anLp-difference function between two vectors whosecardinality is defined by the number of distinct values. In our earlier observa-tions, we noted that sketches are particularly useful in tracking such aggregatefunctions. This is particularly useful in the multi-dimensional case, where thenumber of buckets can be very large as a result of the combination of a largenumber of dimensions. Therefore sketch-based methods can be used for themulti-dimensional case. We will discuss this in detail slightly later. We notethat a number of other objective functions also exist for optimizing histogramconstruction [86]. For example, one can minimize the difference in the areabetween the original distribution, and the corresponding histogram fit. Sincethe space requirement is dictated by the number of buckets, it is also desirable tominimize it. Therefore, the dual problem of minimizing the number of buckets,for a given threshold on the error has been discussed in [63, 78].

One problem with the above definitions is that they use they use absoluteerrors in order to define the accuracy. It has been pointed out in [73]that the


use of absolute error may not always be a good representation of the error.Therefore, some methods for optimizing relative error have been proposed in[53]. While this method is quite efficient, it is not designed to be a data streamalgorithm. Therefore, the design of relative error histogram construction forthe stream case continues to be an open problem.

5.1 One Pass Construction of Equi-depth Histograms

In this section, we will develop algorithms for one-pass construction of equi-depth histograms. The simplest method for determination of the relevant quan-tiles in the data is that of sampling. In sampling, we simply compute theestimated quantileq(S) ∈ [0, 1] of the true quantileq ∈ [0, 1] on a randomsampleS of the data. Then, the Hoeffding inequality can be used to showthatq(S) lies in the range(q − ǫ, q + ǫ) with probability at least1 − δ, if thesample sizeS is chosen larger thanO(log(δ)/ǫ2). Note that this sample size isa constant, and is independent of the size of the underlying data stream.

Letv be the value of the element at quantileq. Then the probability of includ-ing an element inS with value less thanv is a Bernoulli trial with probabilityq.Then the expected number of elements less thanv is q · |S|, and this number liesin the interval(q+ǫ) with probability at least2 · e−2·|S|·ǫ2 (Hoeffding inequal-ity). By picking a value of|S| = O(log(δ)/ǫ2), the corresponding results maybe easily proved. A nice analysis of the effect of sample sizes on histogram con-struction may be found in [12]. In addition, methods for incremental histogrammaintenance may be found in [42]. TheO(log(δ)/ǫ2) space-requirements havebeen tightened toO(log(δ)/ǫ) in a variety of ways. For example, the algorithmsin [71, 72] discuss probabilistic algorithms for tightening this bound, whereasthe method in [49] provides a deterministic algorithm for the same goal.

5.2 Constructing V-Optimal Histograms

An interesting offline algorithm for constructing V-Optimal histograms hasbeen discussed in [63]. The central idea in this approach is to set up a dynamicprogramming recursion in which the partition for the last bucket is determined.Let us consider a histogram drawn on theN ordered distinct values[1 . . . N ].Let Opt(k,N) be the error of the V-optimal histogram for the firstN values,andk buckets.LetV ar(p, q) be the variances of values indexed byp throughqin (1 . . . N). Then, if the last bucket contains valuesr . . . N , then the error ofthe V-optimal histogram would be equal to the sum of the error of the(k− 1)-bucket V-optimal histogram for values up tor− 1, added to the error of the lastbucket (which is simply the variance of the values indexed byr throughN ).Therefore, we have the following dynamic programming recursion:

Opt(k,N) = minrOpt(k − 1, r − 1) + V ar(r,N) (9.19)


We note that there areO(N ·k) entries for the setOpt(k,N), and each entry canbe computed inO(N) time using the above dynamic programming recursion.Therefore, the total time complexity isO(N2 · k).

While this is a neat approach for offline computation, it does not reallyapply to the data stream case because of the quadratic time complexity. In[54], a method has been proposed to construct(1 + ǫ)-optimal histogramsin O(N · k2 · log(N)/ǫ) time andO(k2 · log(N)/ǫ) space. We note that thenumber of bucketsk is typically small, and therefore the above time complexityis quite modest in practice. The central idea behind this approach is that thedynamic programming recursion of Equation 9.19 is the sum of a monotonicallyincreasing and a monotonically decreasing function inr. This can be leveragedto reduce the amount of search in the dynamic programming recursion, if oneis willing to settle for a(1 + ǫ)-approximation. Details may be found in [54].Other algorithms for V-optimal histogram construction may be found in [47,56, 57].

5.3 Wavelet Based Histograms for Query Answering

Wavelet Based Histograms are a useful tool for selectivity estimation, andwere first proposed in [73]. In this approach, we construct the Haarwaveletdecomposition on the cumulative distribution of the data. We note that for adimension withN distinct values, this requiresN wavelet coefficients. As isusually the case with wavelet decomposition, we retain theB Haar coefficientswith the largest absolute (normalized) value. The cumulative distributionθ(b)at a given valueb can be constructed as the sum ofO(log(N)) coefficients on theerror-tree. Then for a range query[a, b], we only need to computeθ(b)− θ(a).

In the case of data streams, we would like to have the ability to maintain thewavelet based histogram dynamically. In this case, we perform the maintenancewith frequency distributions rather than cumulative distributions. We note thatwhen a new data stream elementx arrives, the frequency distribution along agiven dimension gets updated. This can lead to the following kinds of changesin the maintained histogram:

Some of the wavelet coefficients may change and may need to be updated.An important observation here is that only theO(log(N)) wavelet coef-ficients whose ranges includex may need to be updated. We note thatmany of these coefficients may be small and may not be included in thehistogram in the first place. Therefore, only those coefficients which arealready included in the histogram need to be updated. For a coefficientincluding a range of lengthl = 2q we update it by adding or subtract-ing 1/l. We first update all the wavelet coefficients which are currentlyincluded in the histogram.


Some of the wavelet coefficients which are currently not included in thehistogram may become large, and may therefore need to be added to it.Let cmin be the minimum value of any coefficient currently included inthe histogram. For a wavelet coefficient with rangel = 2q, which isnot currently included in the histogram, we add it to be histogram withprobability1/(l∗cmin). The initial value of the coefficient is set tocmin.

The addition of new coefficients to the histogram will increase the totalnumber of coefficients beyond the space constraintB. Therefore, aftereach addition, we delete the minimum coefficient in the histogram.

The correctness of the above method follows from the probabilistic countingresults discussed in [31]. It has been shown in [74] that this probabilisticmethodfor maintenance is effective in practice.

5.4 Sketch Based Methods for Multi-dimensionalHistograms

Sketch based methods can also be used to construct V-optimal histogramsin the multi-dimensional case [90]. This is a particularly useful applicationof sketches since the number of possible buckets in theNd space increasesexponentially withd. Furthermore, the objective function to be optimized hasthe form of anL2-distance function over the different buckets. This can beapproximated with the use of the Johnson-Lindenstrauss result [64].

We note that eachd-dimensional vector can be sketched overNd-spaceusing the same method as the AMS sketch. The only difference is that weare associating the 4-wise independent random variables withd-dimensionalitems. The Johnson-Lindenstrauss Lemma implies that theL2-distances in thesketched representation (optimized overO(b · d · log(N)/ǫ2) possibilities) arewithin a factor(1 + ǫ) of theL2-distances in the original representation for ab-bucket histogram.

Therefore, if we can pick the buckets so thatL2-distances are optimizedin the sketched representation, this would continue to be true for the originalrepresentation within factor(1+ ǫ). It turns out that a simple greedy algorithmis sufficient to achieve this. In this algorithm, we pick the buckets greedily,so that theL2 distances in the sketched representation are optimized in eachstep. It can be shown [90], that this simple approach provides a near optimalhistogram with high probability.

6. Discussion and Challenges

In this paper, we provided an overview of the different methods for syn-opsis construction in data streams. We discussed random sampling, wavelets,sketches and histograms. In addition, many techniques such as clustering can


also be used for synopses construction. Some of these methods are discussed inmore detail in a different chapter of this book. Many methods such as waveletsand histograms are closely related to one another. This chapter explores thebasic methodology of each technique and the connections between differenttechniques. Many challenges for improving synopsis construction methodsremain:

While many synopses construction methods work effectively in indi-vidual scenarios, it is as yet unknown how well the different methodscompare with one another. A thorough performance study needs to beconducted in understanding the relative behavior of different synopsismethods. One important point to be kept in mind is that the “trusty-old”sampling method provides the most effective results in many practicalsituations, where space is not constrained by specialized hardware con-siderations (such as a distributed sensor network). This is especially truefor multi-dimensional data sets with inter-attribute correlations, in whichmethods such as histograms and wavelets become increasingly ineffec-tive. Sampling is however ineffective in counting measures which relyon infrequentbehavior of the underlying data set. Some examples aredistinct element counting and join size estimation. Such a study mayreveal the importance and robustness of different kinds of methods in awide variety of scenarios.

A possible area of research is in the direction of designingworkload awaresynopsis construction methods [75, 78, 79]. While many methods forsynopsis construction optimize average or worst-case performance, thereal aim is to provide optimal results fortypicalworkloads. This requiresmethods for modeling the workload as well as methods for leveragingthese workloads for accurate solutions.

Most synopsis structures are designed in the context of quantitative orcategorical data sets. It would be interesting to examine how synopsismethods can be extended to the case of different kinds of domains such asstring, text or XML data. Some recent work in this direction has designedmethods for XCluster synopsis or sketch synopsis for XML data [82, 83,87].

Most methods for synopsis construction focus on construction of optimalsynopsis over theentire data stream. In many cases, data streams mayevolve over time, as a result of which it may be desirable to constructoptimal synopsis over specific time windows. Furthermore, this windowmay not be known in advance. This problem may be quite challenging tosolve in a space-efficient manner. A number of methods for maintainingexponential histograms and time-decaying stream aggregates [15, 48]


try to account for evolution of the data stream. Some recent work onbiased reservoir sampling[4] tries to extend such an approach to samplingmethods.

We believe that there is considerable scope for extension of the currentsynopsismethods to domains such as sensor mining in which the hardware requirementsforce the use of space-optimal synopsis. However, the objective of constructinga given synopsis needs to be carefully calibrated in order to take the specifichardware requirements into account. While the broad theoretical foundationsof this field are now in place, it remains to carefully examine how these methodsmay be leveraged for applications with different kinds of hardware, computa-tional power, or space constraints.

References

[1] Aggarwal C., Han J., Wang J., Yu P. (2003) A Framework for ClusteringEvolving Data Streams.VLDB Conference.


[3] Aggarwal C. (2006) On Futuristic Query Processing in Data Streams.EDBTConference.

[4] Aggarwal C. (2006) On Biased Reservoir Sampling in the Presence ofStream Evolution.VLDB Conference.

[5] Alon N., Gibbons P., Matias Y., Szegedy M. (1999) Tracking Joins andSelfJoins in Limited Storage.ACM PODS Conference.

[6] Alon N., Matias Y., Szegedy M. (1996) The Space Complexity of Approxi-mating the Frequency Moments.ACM Symposium on Theory of Computing,pp. 20–29/

[7] Arasu A., Manku G. S. Approximate quantiles and frequency counts oversliding windows.ACM PODS Conference, 2004.

[8] Babcock B., Datar M. Motwani R. (2002) Sampling from a Moving Windowover Streaming Data.ACM SIAM Symposium on Discrete Algorithms.

[9] Babcock B., Olston C. (2003) Distributed Top-K Monitoring.ACM SIG-MOD Conference 2003.

[10] Bulut A., Singh A. (2003) Hierarchical Stream summarization in LargeNetworks.ICDE Conference.

[11] Chakrabarti K., Garofalakis M., Rastogi R., Shim K. (2001) ApproximateQuery Processing with Wavelets.VLDB Journal, 10(2-3), pp. 199–223.

[12] Chaudhuri S., Motwani R., Narasayya V. (1998) Random SamplingforHistogram Construction: How much is enough?ACM SIGMOD Confer-ence.


[13] Charikar M., Chen K., Farach-Colton M. (2002) Finding Frequentitemsin data streams.ICALP.

[14] Chernoff H. (1952) A measure of asymptotic efficiency for tests ofahypothesis based on the sum of observations.The Annals of MathematicalStatistics, 23:493–507.

[15] Cohen E., Strauss M. (2003). Maintaining Time Decaying Stream Aggre-gates.ACM PODS Conference.

[16] Cormode G., Garofalakis M., Sacharidis D. (2006) Fast ApproximateWavelet Tracking on Streams.EDBT Conference.

[17] Cormode G., Datar M., Indyk P., Muthukrishnan S. (2002) ComparingData Streams using Hamming Norms.VLDB Conference.

[18] Cormode G., Muthukrishnan S. (2003) What’s hot and what’s not: Track-ing most frequent items dynamically.ACM PODS Conference.

[19] Cormode G., Muthukrishnan S. (2004) What’s new: Finding significantdifferences in network data streams.IEEE Infocom.

[20] Cormode G., Muthukrishnan S. (2004) An Improved Data Stream Sum-mary: The Count-Min Sketch and Its Applications.LATINpp. 29-38.

[21] Cormode G., Muthukrishnan S. (2004) Diamond in the Rough; FindingHierarchical Heavy Hitters in Data Streams.ACM SIGMOD Conference.

[22] Cormode G., Garofalakis M. (2005) Sketching Streams Through the Net:Distributed approximate Query Tracking.VLDB Conference.

[23] Cormode G., Muthukrishnan S., Rozenbaum I. (2005) Summarizing andMining Inverse Distributions on Data Streams via Dynamic Inverse Sam-pling. VLDB Conference.

[24] Das A., Ganguly S., Garofalakis M. Rastogi R. (2004) Distributed Set-Expression Cardinality Estimation.VLDB Conference.

[25] Degligiannakis A., Roussopoulos N. (2003) Extended Wavelets formul-tiple measures.ACM SIGMOD Conference.

[26] Dobra A., Garofalakis M., Gehrke J., Rastogi R. (2002) Processing com-plex aggregate queries over data streams.SIGMOD Conference, 2002.

[27] Dobra A., Garofalakis M. N., Gehrke J., Rastogi R. (2004) Sketch-BasedMulti-query Processing over Data Streams.EDBT Conference.

[28] Domingos P., Hulten G. (2000) Mining Time Changing Data Streams.ACM KDD Conference.

[29] Estan C., Varghese G. (2002) New Directions in Traffic Measurement andAccounting,ACM SIGCOMM, 32(4),Computer Communication Review.

[30] Fang M., Shivakumar N., Garcia-Molina H., Motwani R., Ullman J. (1998)Computing Iceberg Cubes Efficiently.VLDB Conference.


[31] Flajolet P., Martin G. N. (1985) Probabilistic Counting for Database Ap-plications.Journal of Computer and System Sciences, 31(2) pp. 182–209.

[32] Feigenbaum J., Kannan S., Strauss M. Viswanathan M. (1999) An Approx-imateL1-difference algorithm for massive data streams.FOCS Conference.

[33] Fong J., Strauss M. (2000) An ApproximateLp-difference algorithm formassive data streams.STACS Conference.

[34] Ganguly S., Garofalakis M., Rastogi R. (2004) Processing Data StreamJoin Aggregates using Skimmed Sketches.EDBT Conference.

[35] Ganguly S., Garofalakis M, Rastogi R. (2003) Processing set expressionsover continuous Update Streams.ACM SIGMOD Conference

[36] Ganguly S., Garofalakis M., Kumar N., Rastogi R. (2005) Join-DistinctAggregate Estimation over Update Streams.ACM PODS Conference.

[37] Garofalakis M., Gehrke J., Rastogi R. (2002) Querying and mining datastreams: you only get one look (a tutorial).SIGMOD Conference.

[38] Garofalakis M., Gibbons P. (2002) Wavelet synopses with error guaran-tees.ACM SIGMOD Conference.

[39] Garofalakis M, Kumar A. (2004) Deterministic Wavelet Thresholdingwith Maximum Error Metrics.ACM PODS Conference.

[40] Gehrke J., Korn F., Srivastava D. (2001) On Computing CorrelatedAg-gregates Over Continual Data Streams. SIGMOD Conference.

[41] Gibbons P., Mattias Y. (1998) New Sampling-Based Summary Statisticsfor Improving Approximate Query Answers.ACM SIGMOD ConferenceProceedings.

[42] Gibbons P., Matias Y., and Poosala V. (1997) Fast Incremental Mainte-nance of Approximate Histograms.VLDB Conference.

[43] Gibbons P. (2001) Distinct sampling for highly accurate answers to distinctvalue queries and event reports.VLDB Conference.

[44] Gilbert A., Kotidis Y., Muthukrishnan S., Strauss M. (2001) SurfingWavelets on Streams: One Pass Summaries for Approximate AggregateQueries.VLDB Conference.

[45] Gilbert A., Kotidis Y., Muthukrishnan S., Strauss M. (2003) One-passwavelet decompositions of data streams.IEEE TKDE, 15(3), pp. 541–554.(Extended version of [44])

[46] Gilbert A., Kotidis Y., Muthukrishnan S., Strauss M. (2002) How to sum-marize the universe: Dynamic Maintenance of quantiles.VLDB Conference.

[47] Gilbert A., Guha S., Indyk P., Kotidis Y., Muthukrishnan S., Strauss M.(2002) Fast small-space algorithms for approximate histogram mainte-nance.ACM STOC Conference.


[48] Gionis A., Datar M., Indyk P., Motwani R. (2002) Maintaining StreamStatistics over Sliding Windows.SODA Conference.

[49] Greenwald M., Khanna S. (2001) Space Efficient Online ComputationofQuantile Summaries.ACM SIGMOD Conference, 2001.

[50] Greenwald M., Khanna S. (2004) Power-Conserving Computation ofOrder-Statistics over Sensor Networks.ACM PODS Conference.

[51] Guha S. (2005). Space efficiency in Synopsis construction algorithms.VLDB Conference.

[52] Guha S., Kim C., Shim K. (2004) XWAVE: Approximate ExtendedWavelets for Streaming Data.VLDB Conference, 2004.

[53] Guha S., Shim K., Woo J. (2004) REHIST: Relative Error HistogramConstruction algorithms.VLDB Conference.

[54] Guha S., Koudas N., Shim K. (2001) Data-Streams and Histograms.ACMSTOC Conference.

[55] Guha S., Harb B. (2005) Wavelet Synopses for Data Streams: MinimizingNon-Euclidean Error.ACM KDD Conference.

[56] Guha S., Koudas N. (2002) Approximating a Data Stream for Queryingand Estimation: Algorithms and Performance Evaluation.ICDE Confer-ence.

[57] Guha S., Indyk P., Muthukrishnan S., Strauss M. (2002) Histogrammingdata streams with fast per-item processing.Proceedings of ICALP.

[58] Hellerstein J., Haas P., Wang H. (1997) Online Aggregation.ACM SIG-MOD Conference.

[59] Ioannidis Y., Poosala V. (1999) Histogram-Based Approximation of Set-Valued Query-Answers.VLDB Conference.

[60] Ioannidis Y., Poosala V. (1995) Balancing Histogram Optimality and Prac-ticality for Query Set Size Estimation.ACM SIGMOD Conference.

[61] Indyk P., Koudas N., Muthukrishnan S. (2000) Identifying RepresentativeTrends in Massive Time Series Data Sets Using Sketches.VLDB Confer-ence.

[62] Indyk P. (2000) Stable Distributions, Pseudorandom Generators, Embed-dings, and Data Stream Computation,IEEE FOCS.

[63] Jagadish H., Koudas N., Muthukrishnan S., Poosala V., Sevcik K., and SuelT. (1998) Optimal Histograms with Quality Guarantees.VLDB Conference.

[64] Johnson W., Lindenstrauss J. (1984) Extensions of Lipshitz mapping intoHilbert space.Contemporary Mathematics, Vol 26, pp. 189–206.

[65] Karras P., Mamoulis N. (2005) One-pass wavelet synopses for maximumerror metrics.VLDB Conference.


[66] Keim D. A., Heczko M. (2001) Wavelets and their Applications inDatabases.ICDE Conference.

[67] Kempe D., Dobra A., Gehrke J. (2004) Gossip Based Computation ofAggregate Information.ACM PODS Conference.

[68] Kollios G., Byers J., Considine J., Hadjielefttheriou M., Li F.(2005) RobustAggregation in Sensor Networks.IEEE Data Engineering Bulletin.

[69] Kooi R. (1980) The optimization of queries in relational databases.Ph. DThesis, Case Western Reserve University.

[70] Manjhi A., Shkapenyuk V., Dhamdhere K., Olston C. (2005) Finding(recently) frequent items in distributed data streams.ICDE Conference.

[71] Manku G., Rajagopalan S, Lindsay B. (1998) Approximate medians andother quantiles in one pass and with limited memory.ACM SIGMOD Con-ference.

[72] Manku G., Rajagopalan S, Lindsay B. (1999) Random Sampling for SpaceEfficient Computation of order statistics in large datasets.ACM SIGMODConference.

[73] Matias Y., Vitter J. S., Wang M. (1998) Wavelet-based histograms forselectivity estimation.ACM SIGMOD Conference.

[74] Matias Y., Vitter J. S., Wang M. (2000) Dynamic Maintenance of Wavelet-based histograms.VLDB Conference.

[75] Matias Y., Urieli D. (2005) Optimal workload-based wavelet synopsis.ICDT Conference.

[76] Manku G., Motwani R. (2002) Approximate Frequency Counts overDataStreams.VLDB Conference.

[77] Muthukrishnan S. (2004) Workload Optimal Wavelet Synopses.DIMACSTechnical Report.

[78] Muthukrishnan S., Poosala V., Suel T. (1999) On Rectangular Partition-ing in Two Dimensions: Algorithms, Complexity and Applications,ICDTConference.

[79] Muthukrishnan S., Strauss M., Zheng X. (2005) Workload-Optimal His-tograms on Streams.Annual European Symposium, Proceedings inLectureNotes in Computer Science, 3669, pp. 734-745

[80] Olston C., Jiang J., Widom J. (2003) Adaptive Filters for ContinuousQueries over Distributed Data Streams.ACM SIGMOD Conference.

[81] Piatetsky-Shapiro G., Connell C. (1984) Accurate Estimation of the num-ber of tuples satisfying a condition.ACM SIGMOD Conference.

[82] Polyzotis N., Garofalakis M. (2002) Structure and Value Synopsis forXML Data Graphs.VLDB Conference.


[83] Polyzotis N., Garofalakis M. (2006) XCluster Synopses for StructuredXML Content.IEEE ICDE Conference.

[84] Poosala V., Ganti V., Ioannidis Y. (1999) Approximate Query Answeringusing Histograms.IEEE Data Eng. Bull.

[85] Poosala V., Ioannidis Y., Haas P., Shekita E. (1996) Improved Histogramsfor Selectivity Estimation of Range Predicates.ACM SIGMOD Conference.

[86] Poosala V., Ioannidis Y. (1997) Selectivity Estimation without the At-tribute Value Independence assumption.VLDB Conference.

[87] Rao P., Moon B. (2006) SketchTree: Approximate Tree Pattern Countsover Streaming Labeled Trees,ICDE Conference.

[88] Schweller R., Gupta A., Parsons E., Chen Y. (2004) Reversible Sketchesfor Efficient and Accurate Change Detection over Network Data Streams.Internet Measurement Conference Proceedings.

[89] Stolnitz E. J., Derose T., Salesin T. (1996)Wavelets for computer graphics:theory and applications, Morgan Kaufmann.

[90] Thaper N., Indyk P., Guha S., Koudas N. (2002) Dynamic Multi-dimensional Histograms.ACM SIGMOD Conference.

[91] Thomas D. (2006) Personal Communication.

[92] Vitter J. S. (1985) Random Sampling with a Reservoir.ACM Transactionson Mathematical Software, Vol. 11(1), pp 37–57.

[93] Vitter J. S., Wang M. (1999) Approximate Computation of Multi-dimensional Aggregates of Sparse Data Using Wavelets.ACM SIGMODConference.

Chapter 10

A SURVEY OF JOIN PROCESSING INDATA STREAMS

Junyi Xie and Jun YangDepartment of Computer ScienceDuke University

junyi,[email protected]

1. Introduction

Given the fundamental role played by joins in querying relational databases,it is not surprising thatstream joinhas also been the focus of much research onstreams. Recall that relational (theta) join between two non-streaming relationsR1 andR2, denotedR1⊲⊳θR2, returns the set of all pairs〈r1, r2〉, wherer1 ∈ R1,r2 ∈ R2, and the join conditionθ(r1, r2) evaluates totrue. A straightforwardextension of join to streams gives the following semantics (in rough terms):At any timet, the set of output tuples generated thus far by the join betweentwo streamsS1 andS2 should be the same as the result of the relational (non-streaming) join between the sets of input tuples that have arrived thus far inS1

andS2.Stream join is a fundamental operation for relating information from different

streams. For example, given two stream of packets seen by network monitorsplaced at two routers, we can join the streams on packet ids to identify thosepackets that flowed through both routers, and compute the time it took for eachsuch packet to reach the other router. As another example, an online auctionsystem may generate two event streams: One signals opening of auctions andthe other contains bids on the open auctions. A stream join is needed to relatebids with the corresponding open-auction events. As a third example, whichinvolves a non-equality join, consider two data streams that arise in monitoringa cluster machine room, where one stream contains load information collectedfrom different machines, and the other stream contains temperature readingsfrom various sensors in the room. Using a stream join, we can look for possiblecorrelations between loads on machines and temperatures at different locations


in the machine room. In this case, we need to relate temperature readings andload data with close, but necessarily identical, spatio-temporal coordinates.

What makes stream join so special to warrant new approaches different fromconventional join processing? In the stream setting, input tuples arrive contin-uously, and result tuples need to be produced continuously as well. We cannotassume that the input data is already stored or indexed, or that the input ratecan be controlled by the query plan. Standard join algorithms that use block-ing operations, e.g., sorting, no longer work. Conventional methods for costestimation and query optimization are also inappropriate, because they assumefinite input. Moreover, the long-running nature of stream queries calls for moreadaptive processing strategies that can react to changes and fluctuations in dataand stream characteristics. The “stateful” nature of stream joins adds anotherdimension to the challenge. In general, in order to compute the complete resultof a stream join, we need to retain all past arrivals as part of the processing state,because a new tuple may join with an arbitrarily old tuple arrived in the past.This problem is exacerbated by unbounded input streams, limited processingresources, and high performance requirements, as it is impossible in the longrun to keep all past history in fast memory.

This chapter provides an overview of research problems, recent advances, andfuture research directions in stream join processing. We start by elucidatingthe model and semantics for stream joins in Section 2. Section 3 focuseson join state management—the important problem of how to cope with largeand potentially unbounded join state given limited memory. Section 4 coversfundamental algorithms for stream join processing. Section 5 discusses aspectsof stream join optimization, including objectives and techniques for optimizingmulti-way joins. We conclude the chapter in Section 6 by pointing out severalrelated research areas and proposing some directions for future research.

2. Model and Semantics

Basic Model and Semantics. A stream is an unbounded sequence ofstream tuples of the form〈s, t〉 ordered byt, wheres is a relational tuple andt is the timestampof the stream tuple. Following a “reductionist” approach,we conceptually regard the(unwindowed) stream joinbetween streamsS1 andS2 to be a view defined as the (bag) relational join between two append-onlybagsS1 andS2. Whenever new tuples arrive inS1 or S2, the view must beupdated accordingly. Since relational join is monotonic, insertions intoS1 andS2 can result only in possible insertions into the view. The sequence of resultinginsertions into the view constitutes the output stream of the stream join betweenS1 andS2. The timestamp of an output tuple is the time at which the insertionshould be reflected in view, i.e., the larger of the timestamps of the two inputtuples.

A Survey of Join Processing in Data Streams 211

Alternatively, we can describe the same semantics operationally as follows:To compute the stream join betweenS1 andS2, we maintain ajoin statecon-taining all tuples received so far fromS1 (which we callS1’s join state) andthose fromS2 (which we callS2’s join state). For each new tuples1 arriving inS1, we records1 in S1’s join state, probeS2’s join state for tuples joining withs1, and output the join result tuples. New tuples arriving inS2 are processed ina symmetrical fashion.

Semantics of Sliding-Window Joins. An obvious issue with unwindowedstream joins is that the join state is unbounded and will eventually outgrowmemory and storage capacity of the stream processing system. One possibilityis to restrict the scope of the join to a recent window, resulting in asliding-window stream join. For binary joins, we call the two input streamspartnerstreamof each other. Operationally, atime-based sliding windowof durationw on streamS restricts each new partner stream tuple to join only withS tuplesthat arrived within the lastw time units. Atuple-based sliding windowof sizekrestricts each new partner stream tuple to join only with the lastk tuples arrivedin S. Both types of windows “slide” forward, as time advances or new streamtuples arrive, respectively. The sliding-window semantics enables us to purgefrom the join state any tuple that has fallen out of the current window, becausefuture arrivals in the partner stream cannot possibly join with them.

Continuous Query Language, or CQL for short [2], gives the semantics ofa sliding-window stream join by regarding it as a relational join view overthe sliding windows, each of which contains the bag of tuples in the currentwindow of the respective stream. New stream tuples are treated as insertioninto the windows, while old tuples that fall out of the windows are treated asdeletions. The resulting sequences of updates on the join view constitutes theoutput stream of the stream join. Note that deletions from the windows canresult in deletions from the view. Therefore, sliding-window stream joins arenot monotonic. The presence of deletions in the output stream does complicatesemantics considerably. Fortunately, in many situations users may not careabout these deletions at all, and CQL provides anIstream operator for remov-ing them from the output stream. For a time-based sliding-window join, evenif we do not want to ignore deletions in the output stream, it is easy to inferwhen an old output tuple needs to be deleted by examining the timestamps ofthe input tuples that generated it. For this reason, time-based sliding-windowjoin under the CQL semantics is classified as aweak non-monotonicoperatorby Golab andOzsu [24]. However, for a tuple-based sliding-window join, howto infer deletions in the output stream timely and efficiently without relying onexplicitly generated “negative tuples” still remains an open question [24].

There is an alternative definition of sliding-window stream joins that doesnot introduce non-monotonicity. For a time-based sliding-window join with


durationw, we simply regard the stream join betweenS1 andS2 as a relationaljoin view over append-only bagsS1 andS2 with an extra “window join con-dition”: −w ≤ S1.t − S2.t ≤ w. As in the case of an unwindowed streamjoin, the output stream is simply the sequence of updates on the view resultingfrom the insertions intoS1 andS2. Despite the extra window join condition,join remains monotonic; deletions never arise in the output stream becauseS1

andS2 are append-only. This definition of time-based sliding-window join hasbeen used by some, e.g., [10, 27]. It is also possible to define a tuple-basedsliding-window join as a monotonic view over append-only bags (with the helpof an extra attribute that records the sequence number for each tuple in aninputstream), though the definition is more convoluted. This alternative semanticsyields the same sequence of insertions as the CQL semantics. In the remainderof this chapter, we shall assume this semantics and ignore the issue of deletionsin the output stream.

Relaxations and Variations of the Standard Semantics. The semantics ofstream joins above requires the output sequence to reflect the complete sequenceof states of the underlying view, in the exact same order. In some settings thisrequirement is relaxed. For example, the stream join algorithms in [27] maygenerate output tuples slightly out of order. TheXJoin-family of algorithms(e.g., [41, 33, 38]) relaxes the single-pass stream processing model and allowssome tuples to be spilled out from memory and onto disk to be processed later,which means that output tuples may be generated out of order. In any case,the correct output order can be reconstruct from the tuple timestamps. Besidesrelaxing the requirement on output ordering, there are also variations ofslidingwindows that offer explicit control over what states of the view can be ignored.For example, with the “jumping window” semantics [22], we divide the slidingwindow into a number of sub-windows; when the newest sub-window fills up,it is appended to the sliding window while the oldest sub-window in the slidingwindow is removed, and then the query is re-evaluated. This semantics inducesa window that is “jumping” periodically instead of sliding gradually.

Semantics of Joins between Streams and Database Relations. Joinsbetween streams and time-varying database relations have also been consid-ered [2, 24]. Golab andOzsu [24] proposed anon-retroactive relationse-mantics, where each stream tuple joins only with the state of the time-varyingdatabase relation at the time of its arrival. Consequently, an update on thedatabase relation does not retroactively apply to previously generated outputtuples. This semantics is also supported by CQL [2], where the query can beinterpreted as a join between the database relation and a zero-duration slidingwindow over the stream containing only those tuples arriving at the current


time. We shall assume this semantics in our later discussion on joining streamsand database relations.

3. State Management for Stream Joins

In this section, we turn specifically to the problem ofstate managementforstream joins. As discussed earlier, join is stateful operator; without the sliding-window semantics, computing the complete result of a stream join generallyrequires keeping unbounded state to remember all past tuples [1]. The questionis: What is the most effective use of the limited memory resource? How do wedecide what part of the join state to keep and what to discard? Can we mitigatethe problem by identifying and purging “useless” parts of the join state withoutaffecting the completeness of the result? When we run out of memory and areno longer able to produce the complete result, how do we then measure the“error” in an incomplete result, and how do we manage the join state in a wayto minimize this error?

Join state management is also relevant even for sliding-window joins, wherethe join state is bounded by the size of the sliding windows. Sometimes, slid-ing windows may be quite large, and any further reduction of the join state iswelcome because memory is often a scarce resource in stream processingsys-tems. Moreover, if we consider a more general stream processing modelwherestreams are processed not just in fast main memory but instead in a memoryhierarchy involving smaller, faster caches as well as larger, slower disks, joinstate management generalizes into the problem of deciding how to ferry dataup and down the memory hierarchy to maximize processing efficiency.

One effective approach towards join state management is to exploit “hard”constraints in the input streams to reduce state. For example, we might knowthat for a stream, the join attribute is a key, or the value of the join attributealways increases over time. Through reasoning with these constraints and thejoin condition, we can sometimes infer that certain tuples in the join statecannot contribute to any future output tuples. Such tuples can then be purgedfrom the join state without compromising result completeness. In Section 3.1,we examine two techniques that generalize constraints in the stream setting anduse them for join state reduction.

Another approach is to exploit statistical properties of the input streams,which can be seen as “soft” constraints, to help make join state managementdecisions. For example, we might know (or have observed) that the frequencyof each join attribute value is stable over time, or that the join attribute values ina stream can be modeled by some stochastic process, e.g., random walk. Suchknowledge allows us to estimate the benefit of keeping a tuple in the join state(for example, as measured by how many output tuples it is expected to generateover a period of time). Because of the stochastic nature of such knowledge, we


usually cannot guarantee result completeness. However, this approach can beused to minimize the expected error in the incomplete result, or to optimize theorganization of the join state in a memory hierarchy to maximize performance.We discuss this approach in Section 3.2.

3.1 Exploiting Constraints

k-Constraints. Babu et al. [7] introducedk-constraintsfor join statereduction. The parameterk is an adherence parameterthat specifies howclosely a stream adheres to the constraint. As an example ofk-constraints,consider first a “strict”ordered-arrivalconstraint on streamS, which requiresthat the join attribute values ofS tuples never decrease over time. In a networkmonitoring application, a stream of TCP/IP packets transmitted from a sourceto a destination should arrive in the order of their source timestamps (denotedbyts to distinguish them from the tuple timestampst). However, suppose that forefficiency, we instead use UDP, a less reliable protocol with no guaranteeon thedelivery order. Nonetheless, if we can bound the extent of packet reorderingthat occur in practice, we can relax the constraint into an ordered-arrival k-constraint: For any tuples, a tuples′ with an earlier source timestamp (i.e.,s′.ts < s.ts) must arrive as or before thek-th tuple followings. A smallerkimplies a tighter constraint; a constraint withk = 0 becomes strict.

To see howk-constraints can be used for join state reduction, suppose wejoin the packet streamS in the above example with another streamS′ usingthe condition|S.ts − S′.ts| ≤ 10. Without any constraint onS.ts, we mustremember allS′ tuples in the join state, because any futureS tuple could arrivewith a joiningts value. With the ordered-arrivalk-constraint onS.ts, however,we can purge a tuples′ ∈ S′ from the join state as soon ask tuples have arrivedin S following someS tuple with ts > s′.ts + 10. The reason is that thek-constraint guarantees any subsequentS tuples will have source timestampsstrictly greater thans′.ts+10 and therefore not join withs′. Otherk-constraintsconsidered by [7] include generalizations of referential integrity constraints andclustered-arrivalconstraints.

Althoughk-constraints provide some “slack” through the adherence param-eterk, strictly speaking they are still hard constraints in that we assume theconditions must hold strictly afterk arrivals. Babu et al. also developed tech-niques for monitoring streams fork-constraints and determining the value ofkat runtime. Interestingly,k-constraints with dynamically observedk becomenecessarily soft in nature: They can assert that the constraints hold withhighprobability, but cannot guarantee them with absolute certainty.

Punctuations. In contrast tok-constraints, whose forms are known apriori, punctuations, introduced by Tucker et al. [40], are constraints that aredynamically inserted into a stream. Specifically, a punctuation is a tuple of


patterns specifying a predicate that must evaluate tofalsefor all future data tu-ples in the stream. For example, consider an auction system with two streams:Auction(id , info, t) generates a tuple at the opening of each auction (with aunique auction id), andBid(auction id , price, t) contains bids for open auc-tions. When an auction with idai closes, the system inserts a punctuation〈ai, ∗〉 into theBid stream to signal that there will be no more bids for auctionai. Also, since auction ids are unique, following the opening of every auctionaj , the system can also insert a punctuation〈aj , ∗〉 into Auction to signal thatwill be no other auctions with the same id.

Ding et al. [17] developed a stream join algorithm calledPJoin to exploitpunctuations. When a punctuation arrives in a stream,PJoin examines the joinstate of the partner stream and purges those tuples that cannot possibly join withfuture arrivals. For example, upon the arrival of a punctuation〈ai, ∗〉 in Bid ,we can purge anyAuction tuples in the join state with idai (provided that theyhave already been processed for join with all pastBid tuples).PJoin also prop-agates punctuations to the output stream. For example, after receiving〈ai, ∗〉from both input streams, we can propagate〈ai, ∗, ∗〉 to the output, because weare sure that no more output tuple withai can be generated. Punctuation prop-agation is important because propagated punctuations can be further exploitedby downstream operators that receive the join output stream as their input. Dingand Rundensteiner [18] further extended their join algorithm to work with slid-ing windows, which allow punctuations to be propagated quicker. For example,suppose that we set the sliding window to24 hours, and24 hours have past af-ter we saw punctuation〈ai, ∗〉 from Auction. Even if we might not have seen〈ai, ∗〉 yet fromBid , in this case we can still propagate〈ai, ∗, ∗〉 to the output,because futureBid tuples cannot join with anAuction tuple that has alreadyfallen outside the sliding window.

While punctuations are more flexible and generally more expressive thank-constraints, they do introduce some processing overhead. Besides the overheadof generating, processing, and propagating punctuations, we note thatsome pastpunctuations need to be retained as part of the join state, thereby consumingmore memory. For stream joins, past punctuations cannot be purged until we canpropagate them, so it is possible to accumulate many punctuations. Also, not allpunctuations are equally effective in join state reduction, and their effectivenessmay vary for different join conditions. We believe that further researchon thetrade-off between the cost and the benefit of punctuations is needed, and thatmanaging the “punctuation state” poses an interesting problem parallel to joinstate management itself.


3.2 Exploiting Statistical Properties

Strictly speaking, bothk-constraints and punctuations are hard constraints.Now, we explore how to exploit “soft” constraints, or statistical propertiesofinput streams, in join state management. Compared with hard constraints, thesesoft constraints can convey more information relevant to join state management.For example, consider again the UDP packet stream discussed in Section 3.1.Depending on the characteristics of the communication network,kmay need tobe very large for the ordered-arrivalk-constraint to hold. However, it may turnout that99% of the time the extent of packet reordering is limited to a muchsmallerk′, and that80% of the time reordering is limited to an even smallerk′′.Soft, statistical constraints are better at capturing these properties and enablingoptimization based on common cases rather than the worst case.

Given a limited amount of memory to hold the join state, for each incomingstream tuple, we need to make a decision—not unlike a cache replacementdecision—about whether to discard the new tuple (after joining it with thepartner stream tuples in the join state) or to retain it in the join state; in the lattercase, we also need to decide which old tuple to discard from the join state tomake space for the new one. In the following, we shall use the term “cache” torefer to the memory available for keeping the join state.

Before we proceed, we need to discuss how to evaluate a join state manage-ment strategy. There are two major perspectives, depending on the purpose ofjoin state management. The first perspective assumes the single-pass streamprocessing model where output tuples can be produced only from the part of thejoin state that we choose to retain in cache. In this case, our goal is to minimizethe error in (or to maximize the quality of) the output stream compared with thecomplete result. A number of popular measures have been defined from thisperspective:

Max-subset. This measure, introduced by Das et al. [15], aims at pro-ducing as many output tuples as possible. (Note that any reasonablestream join algorithm would never produce any incorrect output tuples,so we can ignore the issue of false positives.) Because input streams areunbounded, we cannot compare two strategies simply by comparing thetotal numbers of output tuples they produce—both may be infinite. Theapproach taken by Srivastava and Widom [37] is to consider the ratiobetween the number of output tuples produced up to some timet and thenumber of tuples in the complete result up tot. Then, a reasonable goalis to maximize this ratio ast tends to infinity.

Sampling rate. Like max-subset, this measure aims at producing as manyoutput tuples as possible, but with the additional requirement that the setof output tuples constitutes a uniform random sample of the complete join


result. Thus, the goal is to maximize the sampling rate. This measure isfirst considered in the stream join setting by [37].

Application-defined importance. This measure is based on the notionof importance specific to application needs. For example,Aurora [39]allows applications to definevalue-based quality-of-service functionsthatspecify the utilities of output tuples based on their attribute values. Thegoal in this case is to maximize the utility of the join result.

The second perspective targets expected performance rather than result com-pleteness. This perspective relaxes the single-pass processing modelby allow-ing tuples to be spilled out from memory and onto disk to be processed laterin “mop-up” phases. Assuming that we still produce the complete answer,our goal is to minimize the total processing cost of the online and the mop-up phases. One measure defined from this perspective is thearchive metricproposed by [15]. This measure has also been used implicitly by theXJoin-family of algorithms ([41, 33, 38], etc.). As it is usually more expensive toprocess tuples that have been spilled out to disk, a reasonable approximation isto try to leave as little work as possible to the mop-up phases; this goal roughlycompatible with max-subset’s objective of getting as much as possible doneonline.

In the remainder of this section, we focus first and mostly on the max-subsetmeasure. Besides being a reasonable measure in its own right, techniques de-veloped for max-subset are roughly in line with the archive metric, and canbe generalized to certain application-defined importance measures throughap-propriate weighting. Next, we discuss the connection between classic cachingand join state management, and state management for joins between streamsand database relations. Finally, we briefly discuss the sampling-rate measuretowards the end of this section.

Max-Subset Measure. Assuming perfect knowledge of the future arrivalsin the input streams, the problem of finding the optimal sequence (up to a giventime) of join state management decisions under max-subset can be cast as a net-work flow problem, and can be solved offline in time polynomial to the lengthof the sequence and the size of the cache [15]. In practice, however,we needan online algorithm that does not have perfect knowledge of the future.Unfor-tunately, without any knowledge (statistical or otherwise) of the input streams,no online algorithm—not even a randomized one—can bek-competitive (i.e.,generating at least1/k as many tuples as an optimal offline algorithm) for anykindependent of the length of the input streams [37]. This hardness result high-lights the need to exploit statistical properties of the input streams. Next, wereview previous work in this area, starting with specific scenarios and endingwith a general approach.


Frequency-Based Model. In the frequency-based model, the join at-tribute value of each new tuple arriving in a stream is drawn independentlyfrom a probability distribution that is stationary (i.e., it does not change overtime). This model is made explicit by [37] and discussed as special case by [45],although it has been implicitly assumed in the development of many join statemanagement techniques. Under this model, assuming an unwindowed streamequijoin, we can calculate thebenefitof a tuples as the product of the partnerstream arrival rate and the probability that a new partner stream tuple hasthesame join attribute value ass. This benefit measures how many output tuplessis expected to generate per unit time in the future. A straightforward strategyis to replace the tuple with the lowest benefit. This strategy, calledPROB,was proposed by [15], and can be easily shown to be optimal for unwindowedjoins. For sliding-window joins, an alternative strategy calledLIFE was pro-posed, which weighs a tuple’s benefit by its remaining lifetime in the slidingwindow. Unfortunately, neitherPROB nor LIFE is known to be optimal forsliding-window joins. To illustrate, suppose that we are faced with the choicebetween two tupless1 ands2, wheres1 has a higher probability of joining withan incoming tuple, buts2 has a longer lifetime, allowing it to generate moreoutput tuples thans2 eventually. PROB would prefers1 while LIFE wouldprefers2; however, neither choice is always better, as we will see later in thissection.

The frequency-based model is also implicitly assumed by [38] in developingtheRPJ (rate-based progressive join) algorithm. RPJ stores the in-memoryportion of each input stream’s join state as a hash table, and maintains necessarystatistics for each hash partition; statistics for individual join attribute valueswithin each partition are computed assuming local uniformity. WhenRPJruns out of memory, it flushes the partition with lowest benefit out to disk. Thisstrategy is analogous toPROB.

Kang et al. [30] assumed a simplified version of the frequency-based model,where each join attribute value occurs with equal frequency in both input streams(though stream arrival rates may differ). With this simplification, the optimalstrategy is to prefer keeping the slower stream in memory, because the tuplesfrom the slower stream get more opportunities to join with an incoming partnerstream tuple. This strategy is also consistent withPROB. More generally,random load shedding[39, 4], or RAND [15], which simply discards inputstream tuples at random, is also justifiable under the max-subset measure bythis equal-frequency assumption.

Age-Based Model. The age-based modelof [37] captures a scenariowhere the stationarity assumption of the frequency-based model breaks downbecause of correlated tuple arrivals in the input streams. Consider theAuction

andBid example from Section 3.1. A recentAuction tuple has a much better


chance of joining with a newBid than an oldAuction tuple. Furthermore,we may be able to assume that the bids for each open auction follow a similararrival pattern. The age-based model states that, for each tuples in streamS(with partner streamS′), the expected number of output tuples thats generatesat each time step during its lifetime is given by a functionp(∆t), where∆tdenotes the age of the tuple (i.e., how long it has been in the join state). Theage-based model further assumes that functionp(∆t) is the same for all tuplesfrom the same stream, independent of their join attribute values. At the firstglance, this assumption may appear quite strong: If we consider two tupleswith the same join attribute value arriving at two different timest1 andt2, weshould havep(t − t1) = p(t − t2) for all t when both tuples are in the joinstate, which would severely limit the form of functionp. However, this issuewill not arise, for example, if the join attribute is a key of the input stream (e.g.,Auction). Because foreign-key joins are so common, the age-based model maybe appropriate in many settings.

An optimal state management strategy for the age-based model, calledAGE,was developed by [37]. Given the functionp(∆t), AGE calculates an optimalage∆to such that the expected number of output tuples generated by a tupleper unit time is maximized when it is kept until age∆to. Intuitively, if everytuple in the cache is kept for exactly∆to time steps, then we are making themost efficient use of every slot in the cache. This optimal strategy is possibleif the arrival rate is high enough to keep every cache slot occupied. Ifnot, wecan keep each tuple to an age beyond the optimal, which would still result in anoptimal strategy assuming thatp(∆t) has no local minima. A heuristic strategyfor the case wherep(∆t) has local minima is also provided in [37].

Towards General Stochastic Models. There are many situations wherethe input stream follows neither the frequency-based model nor the age-basedmodel. For example, consider a measurement streamS1 generated by a networkof sensors. Each stream tuple carries a timestamptm recording the time at whichthe measurement was taken by the sensor (which is different from the streamtimestampt). Because of processing delays at the senors, transmission delaysin the network, and a network protocol with no in-order delivery guarantee(e.g., UDP), thetm values do not arrive in order, but may instead follow adiscretized bounded normal distribution centered at the current time minusthe average latency. Figure 10.1 shows the pdf (probability density function)of this distribution, which moves right as time progresses. Suppose there isa second streamS2 of timestamped measurements of a different type comingfrom another network of sensors, which is slower and less reliable. Theresultingdistribution has a higher variance and looser bounds, and lags slightly behindthat of S1. To correlate measurements fromS1 andS2 by time, we use anequijoin ontm. Intuitively, as the pdf curve forS2 moves over the join attribute


y xz

t0

pro

bab

ilit

y

join attribute value

S2

S1

Figure 10.1. Drifting normal distributions.

By

Bx

Bz

∆ t

exp

ecte

dcu

mu

lative

be

ne

fit

Figure 10.2. Example ECBs.

value of a cachedS1 tuple, this tuple gets a chance of joining with each incomingS2 tuple, with a probability given byS2’s pdf at that time. Clearly, thesestreams do not follow the frequency-based model, because the frequency ofeachtm value varies over time. They do not follow the age-based model either,because each arriving tuple may have a differentp(∆t) function, which dependson the location of the partner stream’s pdf. Blindly applying a specific joinstate management strategy without verifying its underlying assumption maylead to very poor performance. To illustrate, consider the two tuplesx andy from streamS2 in Figure 10.1 currently cached at timet0 as part of thejoin state. Which tuple should we choose to discard when we are low onmemory? Intuitively, it is better to discardy since it has almost already “missed”the moving pdf ofS1 and is therefore unlikely to join with futureS1 tuples.Unfortunately, if we use the past to predict future,PROB might make the exactopposite decision:y would be kept because it probably has joined more timeswith S1 in the pastthanx.

Work by Xie et al. [45] represents a first step towards developing generaltechniques to exploit a broader class of statistical properties, without being tiedto particular models or assumptions. A general question posed by [45] is, giventhe stochastic processes modeling the join attribute values of the input streamtuples, what join state management strategy has the best expected performance?In general, the stochastic processes can be non-stationary (e.g., the joinattributevalue follows a random walk, or its mean drifts over time) and correlated (e.g.,if one stream has recently produced a certain value, then it becomes more likelyfor the other stream to produce the same value).

Knowing the stochastic processes governing the input streams gives us con-siderable predictive power, but finding the optimal join state management strat-egy is still challenging. A brute-force approach (calledFlowExpect in [45])would be the following. Conceptually, starting from the current time and thecurrent join state, we enumerate all possible sequences of future “state manage-ment actions” (up to a given length), calculate the expected number of output


tuples for each sequence, and identify the optimal sequence. This search prob-lem can be formulated and solved as a network flow problem. The first actionin the optimal sequence is taken at the current time. As soon as any new tuplearrives, we solve the problem again with the new join state, and take the firstaction in the new optimal sequence. The process then repeats. Interestingly,Xie et al. [45] showed that it is not enough to consider all possible sequencesof unconditionalstate management actions; we must also consider strategiesthat make actionsconditional upon the join attribute values of future tuples.An example of a conditional action at a future timet might be: “If the newtuple arriving att has value100 as its join attribute, then discard the new tuple;otherwise use it to replace the tuple currently occupying the fifth cache slot.”Unfortunately, searching through the enormous space of conditional action se-quences is not practical. Therefore, we need to develop simpler, more practicalapproaches.

It turns out that under certain conditions, the best state management actionisclear. Xie et al. [45] developed anECB dominance test(or dom-testfor short)to capture these conditions. From the stochastic processes governing theinputstreams, we can compute a tuples’s ECB (expected cumulative benefit) withrespect to the current timet0 as a functionBs(∆t), which returns the numberof output tuples thats is expected to generate over the period(t0, t0 +∆t]. As aconcrete example, Figure 10.2 plots the ECBs of tuplesx, y, andz from streamS2 in Figure 10.1. Intuitively, we prefer removing tuples with the “lowest” ECBsfrom the cache. The dom-test states that, if the ECB of tuples1 dominatesthatof tuples2 (i.e.,Bs1(∆t) ≥ Bs2(∆t) for all ∆t > 0), then keepings1 is betterthan or equally good as keepings2. For example, from Figure 10.2, we seethat tupley is clearly the least preferable among the three. However, becausethe ECBs ofx andz cross over, the dom-test is silent on the choice betweenx andz. To handle “incomparable” ECBs such as these, Xie et al. proposed aheuristic measure that combines the ECB with a heuristic “survival probability”functionLs(∆t) estimating the probability for tuples to be still cached at timet0 + ∆t. Intuitively, if we estimate thatx andz will be replaced before thetime when their ECBs cross, thenx is more preferable; otherwise,z is morepreferable. Although the heuristic strategy cannot guarantee optimality in allcases, it always agrees with the decision of the dom-test whenever that test isapplicable.

It is instructive to see how the general techniques above apply to specificscenarios. To begin, consider the simple case of unwindowed stream joinsunderthe frequency-based model. The ECB of a tuples is simply a linear functionBs(∆t) = b(s)∆t, whereb(s) is the number of output tuples thats is expectedto generate per unit time, consistent with the definition of “benefit” discussedearlier in the context of the frequency-based model. Obviously, for two tupless1ands2, s1’s ECB dominatess2’s ECB if and only ifb(s1) ≥ b(s2). Therefore,


the dom-test basically yieldsPROB, and provides a proof of its optimality. Thecase of sliding-window joins is considerably more complex, and as discussedearlier, the optimal state management strategy is not known. As illustrated byFigure 10.3, the ECB of a tuples consists of two connected pieces: The firstpiece has slopeb(s), while the second piece is flat and begins at the timel(s)whens will drop out of the sliding window. While the dom-test does not helpin comparings1 ands2 in Figure 10.3, some insight can still be gained fromtheir ECBs. Suppose we decide to caches1 and discards2. If at time l(s1)whens1 exits the join state, a new tuple will be available to take its place andproduce at leastBs2(l(s2)) − Bs1(l(s1)) output tuples during(l(s1), l(s2)],then our decision is justified. Still, the exact condition that guarantees theoptimality of the decision is complex, and will be an interesting problem forfurther investigation.

Finally, let us try applying the ECB-based analysis to the age-based model,for which we know thatAGE [37] is optimal. Under the age-based model,every new tuple has the same ECBB0(∆t) at the time of its arrival. As thetuple ages in the cache, its ECB “shifts”: The ECB of a tuple at aget isBt(∆t) = B0(t+ ∆t)−B0(t). For some shapes ofB0, it is possible to haveECBs that are not comparable by the dom-test. Figure 10.4 illustrates one suchexample; the marks on the ECB curves indicate when the respective tuplesreach their optimal ages. Between two tuplesold andnew in Figure 10.4,the correct decision (byAGE) is to ignore the new tuple and keep caching theold tuple (until it reaches its optimal age). Unfortunately, however, the dom-test is unable to come to any conclusion, for the following two reasons. First,the dom-test actually provides a stronger optimality guarantee thanAGE: Thedom-test guarantees the optimality of its decisions overany time period; incontrast,AGE is optimal when the period tends to infinity. Second, the dom-test examines only the two ECBs in question and does not make use of anyglobal information. However, in order to realize that replacingold tuple is not

exp

ecte

d

cum

ula

tive

ben

efit

) ( 2 S l ) ( 1 S l

2 S B

1 S B

new tuple(s)

available?

t

Figure 10.3. ECBs for sliding-window joinsunder the frequency-based model.

exp

ecte

d

cum

ula

tive

ben

efit

t

new tuple available?

new

old

Figure 10.4. ECBs under the age-basedmodel.


worthwhile in Figure 10.4, we need to be sure that when we do discardold whenit reaches its optimal age, there will be a new tuple available at that time withhigh enough benefit to make up for the loss in discardingnew earlier. Indeed,the age-based model allows us to make this conclusion from its assumption thatevery incoming tuple has the same ECB. It remains an open problem to developbetter, general techniques to overcome the limitations of the dom-test withoutpurposefully “special-casing” for specific scenarios.

An important practical problem is that we may not know in advance theparameter values of the stochastic processes modeling the input streams. Onepossibility is to use existing techniques to compute stream statistics online, oroffline over the history of observations. Another approach proposedby [45] isto monitor certain statistics of the past behavior of the cache and input streams,and use them to estimate the expected benefit of caching. A notable feature ofthe proposed method is that it considers the form of the stochastic processinorder to determine what statistics to monitor. This feature is crucial because,fortime-dependent processes, the past is not always indicative of the future. Forexample, suppose that the join attribute values in a stream follow a distributionwhose shape is stationary but mean is drifting over time. Simply tracking thefrequency of each value is not meaningful as it changes all the time. Instead, wecan subtract the current mean from each observed value, and track the frequencyof these offset values, which will remain constant over time.

One direction for future research is to investigate how statistical propertiesof the input stream propagate to the output of stream joins. This problem isimportant if we want to apply the techniques in this section to more complexstream queries where the output of a stream join may be the input to another.While there has been some investigation of the non-streaming version of thisproblem related to its application in query optimization, there are many sta-tistical properties unique to streams (e.g., trends, orderedness, clusteredness)whose propagation through queries is not yet fully understood.

Relationship to Classic Caching. A natural question is how the streamjoin state management problem differs from the classic caching problem. Manycache replacement policies have been proposed in the past, e.g.,LFD (longest-forward distance),LRU (least-recently used),LFU (least-frequently used), etc.All seem applicable to our problem. After all, our problem comes down todeciding what to retain in a cache to serve as many “reference requests”by thepartner stream as possible. As pointed out by [45], there is a subtle but importantdifference between caching stream tuples and caching regular objects.Whencaching regular objects, we can recover from mistakes easily: The penalty ofnot caching an object is limited to a single cache miss, after which the objectwould be brought in and cached if needed. In contrast, in the case of streamjoin state management, a mistake can cost a lot more: If we discard a tuple


completely from the join state, it is irrevocably gone, along with all outputtuples that it could generate in the future. This difference explains whyLFD,the optimal replacement policy for classic caching, turns out to be suboptimalfor stream join state management, where references beyond the first onealsomatter.

Even if we relax the single-pass stream processing model and allow state tobe spilled out to disk, join state management still differs from classic caching.The reason is that stream processing systems, e.g., those running theXJoin-family of algorithms, typically recover missing output tuples by processingflushed tuples later offline. In other words, “cache misses” are not processedonline as in classic caching—random disk accesses may be too slow for streamapplications, and just to be able to detect that there has indeed been a cachemiss (as opposed to a new tuple that does not join with any previous arrivals)requires maintaining extra state.

Despite their differences, classic caching and stream join state managementcan be tackled under the same general analytical framework proposed by [45].In fact, classic caching can be reduced to stream join state management, andcanbe analyzed using ECBs, in some cases yielding provably optimal results thatagree with or extend classic ones. Such consistency is evidence of the stronglink between the two problems, and a hint that some results on classic cachingcould be brought to bear on the state management problem for stream joins.

Joining with Database Relation. Interestingly, unlike the case of joiningtwo streams, state management for joining a stream and a database relationunder the non-retroactive relation semantics (Section 2) is practically identicalto classic caching [45]. First, it is easy to see that there is no benefit at allincaching any stream tuples, because under the non-retroactive relationsemanticsthey do not join with any future updates to the relation. On the other hand,for tuples in the database relation, their current version can be cached infastmemory to satisfy reference requests by stream tuples. Upon a cache miss,the disk-resident relation can be probed. It would be interesting to investigatewhether it makes sense to defer handling of misses toXJoin-style “mop-up”phases. However, care must be taken to avoid joining old stream tuples withnewer versions of the database relation.

The Sampling-Rate Measure. By design, join state management strategiesoptimized for max-subset favor input tuples that are more likely to join with thepartner stream, causing such tuples to be overrepresented in the result. Whilethis bias is not a problem in many contexts, it can be an issue if a statisticallymeaningful sample is desired, e.g., to obtain unbiased statistics of the join result.In this case, we should use the sampling-rate measure.


Getting an unbiased random sample of a join result has long been recognizedas a difficult problem [34, 12]. The straightforward approach of samplingeach input uniformly and then joining them does not work—the result maybe arbitrarily skewed and small compared with the actual result. The hardnessresult from [37] states that for arbitrary input streams, if the available memory isinsufficient to retain the entire sliding windows (or the entire history, if the joinis unwindowed), then it is impossible to guarantee a uniform random sample forany nonzero sampling rate. The problem is that any tuple we choose to discardmay turn out to be the one that will generate all subsequent output tuples.

Srivastava and Widom [37] developed a procedure for generating unbiasedrandom samples of join results under the frequency-based and age-based mod-els. The procedure requires knowledge of the model parameters, and uses themto determine the maximum sampling rate under the constraint that the proba-bility of running out of memory at runtime is sufficiently small. The procedurekeeps each stream tuple in the join state until the tuple will not contribute anymore result tuples to the sample. Note that not all join result tuples that canbe obtained from the join state will actually be output—many may need to dis-carded in order to keep the sample unbiased. This inefficient use of resourcesis unavoidable because of the stringent requirement of a truly random sample.A statistically weaker form of sampling calledcluster sampling, which usesresources more efficiently, was also considered by [37]. Cluster sampling isstill unbiased, but is no longer independent; i.e., the inclusion of tuples is notindependent of each other. Which type of sampling is appropriate depends onhow the join result will be used.

4. Fundamental Algorithms for Stream Join Processing

Symmetric hash join(SHJ ) is a simple hashing-based join algorithm, whichhas been used to support highly pipelined processing in parallel database sys-tems [44]. It assumes that the entire join state can be kept in main memory; thejoin state for each input stream is stored in a hash table. For each incomingStuple,SHJ inserts it into the hash table forS, and uses it to probe the hash tablefor the partner stream ofS to identify joining tuples. SHJ can be extendedto support the sliding-window semantics and the join statement managementstrategies in Section 3, thoughSHJ is limited to the single-pass stream process-ing model. Golab et al. [22] developed main-memory data structures especiallysuited for storing sliding windows, with efficient support for removing tuplesthat have fallen out of the sliding windows.

Both XJoin [41] andDPHJ (double pipelined hash join) of Tukwila [29]extendSHJ by allowing parts of the hash tables to be spilled out to disk forlater processing. This extension removes the assumption that the entire joinstate must be kept in memory, greatly enhancing the applicability of the algo-


rithm. Tukwila’s DPHJ processes disk-resident tuples only when both inputsare exhausted;XJoin schedules joins involving disk-resident tuples wheneverthe inputs are blocked, and therefore is better suited for stream joins with un-bounded inputs. One complication is the possibility of producing duplicateoutput tuples.XJoin pioneers the use of timestamp marking for detecting du-plicates. Timestamps record the period when a tuple was in memory, and thetimes when a memory-resident hash partition was used to join with the corre-sponding disk-resident partition of the partner stream. From these timestamps,XJoin is able to infer which pairs of tuples have been processed before.

XJoin is the basis for many stream join algorithms developed later, e.g., [33,38]. RPJ [38] is the latest in the series. One of the main contributions ofRPJ,discussed earlier in Section 3.2, is a statistics-based flushing strategy that triesto keep in memory those tuples that are more likely to join. In contrast,XJoinflushes the largest hash partition;HMJ (hash merge join) of [33] always flushescorresponding partitions together, and tries to balance memory allocation be-tween incoming streams. NeitherXJoin norHMJ takes tuple join probabilitiesinto consideration. UnlikeHMJ, which joins all previously flushed data when-ever entering a disk-join phase,RPJ breaks down the work into smaller units,which offer more scheduling possibilities. In particular,RPJ also uses statisticsto prioritize disk-join tasks in order to maximize output rate.

There are a number of interesting open issues. First, can we exploit statisticsbetter by allowing flushing of individual tuples instead of entire hash partitions?This extension would allow us to apply the fine-grained join state managementtechniques from Section 3.2 to theXJoin-family of algorithms. However, thepotential benefits must be weighed against the overhead in statistics collectionand bookkeeping to avoid duplicates. Second, is it ever beneficial to reintroducea tuple that has been previously flushed to disk back into memory? Again, whatwould be the bookkeeping overhead involved? Third, can we develop betterstatistics collection methods forRPJ? Currently, it maintains statistics on thepartition level, but the hash function may map tuples with very different statisticsto the same partition.

Sorting-based join algorithms, such as the sort-merge join, have been tradi-tionally deemed inappropriate for stream joins, because sorting is a blockingoperation that requires seeing the entire input before producing any output. Tocircumvent this problem, Dittrich et al. [20] developed an algorithm calledPMJ(progressive merge join) that is sorting-based but non-blocking. In fact, bothRPJ andHMJ usePMJ for joining disk-resident parts of the join state. Theidea ofPMJ is as follows. During the initial sorting phase that creates the ini-tial runs,PMJ sorts portions of both input streams in parallel, and immediatelyproduces join result tuples from the corresponding runs that are in memory atthe same time. During the subsequent merge phases that merge shorter runsinto longer ones,PMJ again processes both input streams in parallel, and joins


them while their runs are in memory at the same times. To ensure that the outputcontains no duplicates,PMJ does not join tuples from corresponding shorterruns that have been joined in a previous phase; the duplicate avoidance logic isconsiderably simpler thanXJoin. Of course,PMJ pays some price for its non-blocking feature—it does incur a moderate amount of overhead comparedtothe basic sort-merge join. On the other hand,PMJ also inherits the advantagesof sorting-based algorithms over hashing-bashed algorithms, including in par-ticular the ability to handle non-equality joins. A more thorough performancecomparison betweenPMJ andXJoin for equijoins would be very useful.

5. Optimizing Stream Joins

Optimizing Response Time. Viglas and Naughton [42] introduced thenotion of rate-based optimizationand considered how to estimate the outputrate of stream operators. An important observation is that standard costanalysisbased on total processing cost is not applicable in the stream setting, becauseinfinite costs resulted from unbounded inputs cannot be compared directly.Even if one can “hack” the analysis by assuming a large (yet bounded) input,classic analysis may produce incorrect estimate of the output rate since it ignoresthe rate at which inputs (or intermediate result streams) are coming. Specifically,classic analysis assumes that input is available at all times, but in practiceoperators could be blocked by the input. The optimization objectives consideredby [42] are oriented towards response time: For a stream query, how can weproduce the largest number of output tuples in a given amount of time, orproduce a given number of output tuples in the shortest amount of time?

As an example of response-time optimization, Hammad et al. [28] studiedshared processing of multiple sliding-window joins, focusing on developingscheduling strategies aimed at reducing response times across queries. Morebroadly speaking, work on non-blocking join algorithms, e.g.,XJoin andPMJdiscussed earlier, also incorporate response-time considerations.

Optimizing Unit-Time Processing Cost. Kang et al. [30] were amongthe first to focus specifically on optimization of stream joins. They made thesame observation as in [42] that optimizing the total processing cost is no longerappropriate with unbounded input. However, instead of optimizing responsetime, they propose to optimize the processing cost per unit time, which isequivalent to the average processing cost per tuple weighted by the arrival rate.Another important observation made by [30] is that the best processing strategymay beasymmetric; i.e., different methods may be used for joining a newS1

tuple withS2’s join state and for joining a newS2 tuple withS1’s join state.For example, suppose thatS1 is very fast andS2 is very slow. We may indexS2’s join state as a hash table while leavingS1’s join state not indexed. The


reason for not indexingS1 is that its join state is frequently updated (becauseS1 is fast) but rarely queried (becauseS2 is slow).

Ayad and Naughton [4] provided more comprehensive discussions of opti-mization objectives for stream queries. An important observation is that, givenenough processing resources, the steady-state output rate of a query is inde-pendent of the execution plan and therefore should be not be the objective ofquery optimization; a cost-based objective should be used in this case instead.Another interesting point is that load shedding considerations should be incor-porated into query optimization: If we simply shed load from a plan that wasoriginally optimized assuming sufficient resources, the resulting plan may besuboptimal.

Optimizing Multi-Way Stream Joins. XJoin can be used to implementmulti-way joins in a straightforward manner. For instance, a four-way joinamongS1, S2, S3, andS4 can be implemented as a series ofXJoins, e.g.,((S1 XJoinS2) XJoinS3) XJoinS4. SinceXJoin needs to store both of its in-puts in hash tables, the example plan above in effect materializes the intermedi-ate resultsS1 XJoinS2 and(S1 XJoinS2) XJoinS3. An obvious disadvantageof this plan is that these intermediate results can become quite large and costly tomaintain. Another disadvantage is that this plan is static and fixes the join order.For example, a newS3 tuple must be joined with the materializedS1 XJoinS2

first, and then withS4; the option of joining the newS3 tuple first withS4 issimply not available.

Viglas et al. [43] proposedMJoin to combat the above problems.MJoinmaintains a hash table for each input involved in the multi-way join. When atuple arrives, it is inserted into the corresponding hash table, and then used toprobe all other hash tables in some order. This order can be different for tuplesfrom different input streams, and can be determined based on join selectivities.Similar toXJoin, MJoin can flush join state out to disk when low on memory.Flushing is random (because of the assumption of a simple statistical model),but for the special case of star joins (where all streams join on the same attribute),flushing is “coordinated”: When flushing one tuple, joining tuples from otherhash tables are also flushed, because no output tuples can be produced unlessjoining tuples are found in all other hash tables. Note that coordinated flushingdoes not bring the same benefit for binary joins, because in this case outputtuples are produced by joining an incoming tuple with the (only) partner streamhash table, not by joining two old tuples from difference hash tables.

Finding the optimal join order inMJoin is challenging. A simple heuristicthat tracks selectivity for each hash table independently would have troublewith the following issues: (1) Selectivities can be correlated; e.g., a tuple thatalready joins withS1 will be more likely to join withS2. (2) Selectivities mayvary among individual tuples; e.g., one tuple may join with manyS1 tuples but


few S2 tuples, while another tuple may behave in the exact opposite way. Thefirst issue is tackled by [5], who provided a family of algorithms for adaptivelyfinding the optimal order to apply a series of filters (joining a tuple with a streamcan be regarded as subjecting the tuple to a filter) through runtime profiling. Inparticular, theA-Greedy algorithm is able to capture correlations among filterselectivities, and is guaranteed to converge to an ordering within a constantfactor of the optimal. The theoretical guarantee extends to star joins; for generaljoin graphs, thoughA-Greedy still can be used, the theoretical guarantee nolonger holds. The second issue is recently addressed by an approachcalledCBR [9], or content-based routing, which makes the choice of query plandependent on the values of the incoming tuple’s “classifier attributes,” whosevalues strongly correlate with operator selectivities. In effect,CBR is able toprocess each incoming tuple with a customized query plan.

One problem withMJoin is that it may incur a significant amount of recom-putation. Consider again the four-way join amongS1, . . . , S4, now processedby a singleMJoin operator. Whenever a new tuples3 arrives inS3, MJoinin effect executes the queryS1 ⊲⊳ S2 ⊲⊳ s3 ⊲⊳ S4; similarly, whenever a newtuple s4 arrives inS4, MJoin executesS1 ⊲⊳ S2 ⊲⊳ S3 ⊲⊳ s4. The commonsubqueryS1 ⊲⊳ S2 is processed over and over again for theseS3 andS4 tuples.In contrast, theXJoin plan((S1 XJoinS2) XJoinS3) XJoinS4 materializes allits intermediate results in hash tables, includingS1 ⊲⊳ S2; new tuples fromS3

andS4 simply have to probe this hash table, thereby avoiding recomputation.The optimal solution may well lie between these two extremes, as pointed outby [6]. They proposed an adaptive caching strategy,A-Caching, which startswith MJoins and adds join subresult caches adaptively.A-Caching profilescache benefit and cost online, selects caches dynamically, and allocatesmem-ory to caches dynamically. With this approach, the entire spectrum of cachingoptions fromMJoins toXJoins can be explored.

A number of other papers also consider multi-way stream joins. Golab andOzsu [23] studied processing and optimization of multi-way sliding-windowjoins. Traditionally, we eagerly remove (expire) tuples that are no longer part ofthe sliding window, and eagerly generate output tuples whenever input arrives.The authors proposed algorithms supporting lazy expiration and lazy evaluationas alternatives, which achieve higher efficiency at the expense of higher memoryrequirements and longer response times, respectively. Hammad et al. [27]considered multi-way stream joins where a time-based window constraint canbe specified for each pair (or, in general, subset) of input streams. Aninterestingalgorithm calledFEW is proposed, which computes a forward point in timebefore which all arriving tuples can join, thereby avoiding repeated checkingof window constraints.

Eddies[3] are a novel approach towards stream query processing and opti-mization that is markedly different from the standard plan-based approaches.


Eddies eliminate query plans entirely by routing each input tuple adaptivelyacross the operators that need to process it. Interestingly, in eddies, thebehav-ior of SteM [36] mimics that ofMJoin, while STAIRS [16] is able to emulateXJoin. Note that while eddies provide themechanismsfor adapting the pro-cessing strategy on an individual tuple basis, currently theirpoliciestypicallydo not result in plans that change for every incoming tuple. It would be nice tosee how features ofCBR can be supported in eddies.

6. Conclusion

In this chapter, we have presented an overview of research problems andrecent advances in join processing for data streams. Stream processing is ayoung and exciting research area, yet it also has roots in and connections towell-established areas in databases as well as computer science in general. InSection 3.2, we have already discussed the relationship between stream joinstate management and classic caching. Now, let us briefly re-examine partsofthis chapter in light of their relationship tomaterialized views[25].

The general connection between stream processing and materialized viewshas long been identified [8]. This connection is reflected in the way that wespecify the semantics of stream joins—by regarding them as views and definingtheir output as the view update stream resulting from base relation updates(Section 2). Recall that the standard semantics requires the output sequence toreflect the exact sequence of states of the underlying view, which is analogousto the notion ofcomplete and strong consistencyof a data warehouse viewwith respect to its source relations [46]. The connection does not stop atthesemantics. The problem of determining what needs to be retained in the state tocompute a stream join is analogous to the problem of deriving auxiliary viewsto make a join viewself-maintainable[35]. Just as constraints can be used toreduce stream join state (Section 3.1), they have also been used to help expiredata from data warehouses without affecting the maintainability of warehouseviews [21]. For a stream joinS1 ⊲⊳ · · ·⊲⊳Sn, processing an incoming tuple fromstreamSi is analogous to maintaining a join view incrementally by evaluatinga maintenance queryS1 ⊲⊳ · · · ⊲⊳∆Si ⊲⊳ · · · ⊲⊳ Sn. Since there aren differentforms of maintenance queries (one for eachi), it is natural to optimize eachform differently, which echoes the intuition behind the asymmetric processingstrategy of [30] andMJoin [43]. In fact, we can optimize the maintenance queryfor eachinstanceof ∆Si, which would achieve the same goal of supporting acustomized query plan for each tuple asCBR [9]. Finally, noticing that themaintenance queries run frequently and share many common subqueries, wemay choose to materialize some subqueries as additional views to improvequery performance, which is also whatA-Caching [6] tries to accomplish.


Of course, despite high-level similarities, techniques from the two areas—data streams and materialized views—may still differ significantly in actualdetails. Nonetheless, it would be nice to develop a general framework thatuni-fies both areas, or, less ambitiously, to apply ideas from one area to the other.Many such possibilities exist. For example, methods and insights from the well-studied problems ofanswering query using views[26] andview selection[14]could be extended and applied to data streams: Given a set of stream queriesrunning continuously in a system, what materialized views (over join states anddatabase relations) and/or additional stream queries can we create to improve theperformance of the system? Another area is distributed stream processing. Dis-tributed stream processing can be regarded as view maintenance in a distributedsetting, which has been studied extensively in the context of data warehous-ing. Potentially applicable in this setting are techniques for making warehouseself-maintainable [35], optimizing view maintenance queries across distributedsources [31], ensuring consistency of multi-source warehouse views[46], etc.Conversely, stream processing techniques can be applied to materialized viewsas well. In particular, view maintenance could benefit from optimization tech-niques that exploit update stream statistics (Section 3.2). Also, selection ofmaterialized views for performance can be improved by adaptive caching tech-niques (Section 5).

Besides the future work directions mentioned above and throughout thechapter, another important direction worth exploring is the connection betweendata stream processing anddistributed event-based systems[19] such as pub-lish/subscribe systems. Such systems need to scale to thousands or even millionsof subscriptions, which are essentially continuous queries over event streams.While efficient techniques for handling continuous selections already exist,scalable processing of continuous joins remains a challenging problem. Ham-mad et al. [28] considered shared processing of stream joins with identical joinconditions but different sliding-window durations. We need to consider moregeneral query forms, e.g., joins with different join conditions as well as addi-tional selection conditions on input streams.NiagaraCQ[13] andCACQ[32]are able to group-process selections and share processing of identical join oper-ations. However, there is no group or shared processing of joins with differentjoin conditions, and processing selections separately from joins limits optimiza-tion potentials.PSoup[11] treats queries as data, thereby allowing set-orientedprocessing of queries with arbitrary join and selection conditions. Still, newindexing and processing techniques must be developed for the system to be ableto process each event in time sublinear in the number of subscriptions.


Acknowledgments

This work is supported by a NSF CAREER Award under grant IIS-0238386.We would also like to thank Shivnath Babu, Yuguo Chen, Kamesh Munagala,and members of the Duke Database Research Group for their discussions.

References

[1] Arasu, A., Babcock, B., Babu, S., McAlister, J., and Widom, J. (2002). Char-acterizing memory requirements for queries over continuous data streams.In Proceedings of the 2002 ACM Symposium on Principles of DatabaseSystems, pages 221–232, Madison, Wisconsin, USA.

[2] Arasu, A., Babu, S., and Widom, J. (2003). The CQL continuous querylanguage: Semantic foundations and query execution. Technical Report2003–67, InfoLab, Stanford University.

[3] Avnur, R. and Hellerstein, J. M. (2000). Eddies: Continuously adaptivequery processing. InProceedings of the 2000 ACM SIGMOD InternationalConference on Management of Data, pages 261–272, Dallas, Texas, USA.

[4] Ayad, A. and Naughton, J. F. (2004). Static optimization of conjunctivequeries with sliding windows over infinite streams. InProceedings of the2004 ACM SIGMOD International Conference on Management of Data,pages 419–430, Paris, France.

[5] Babu, S., Motwani, R., Munagala, K., Nishizawa, I., and Widom, J. (2004a).Adaptive ordering of pipelined stream filters. InProceedings of the 2004ACM SIGMOD International Conference on Management of Data, pages407–418, Paris, France.

[6] Babu, S., Munagala, K., Widom, J., and Motwani, R. (2005). Adaptivecaching for continuous queries. InProceedings of the 2005 InternationalConference on Data Engineering, Tokyo, Japan.

[7] Babu, S., Srivastava, U., and Widom, J. (2004b). Exploitingk-constraintsto reduce memory overhead in continuous queries over data streams.ACMTransactions on Database Systems, 29(3):545–580.

[8] Babu, S. and Widom, J. (2001). Continuous queries over data streams.ACM SIGMOD Record.

[9] Bizarro, P., Babu, S., DeWitt, D., and Widom, J. (2005). Content-basedrouting: Different plans for different data. InProceedings of the 2005 Inter-national Conference on Very Large Data Bases, Trondheim, Norway.

[10] Carney, D., Cetintemel, U., Cherniack, M., Convey, C., Lee, S., Seidman,G., Stonebraker, M., Tatbul, N., and Zdonik, S. B. (2002). Monitoring streams- a new class of data management applications. InProceedings of the 2002


International Conference on Very Large Data Bases, pages 215–226, HongKong, China.

[11] Chandrasekaran, S. and Franklin, M. J. (2003). PSoup: a system forstreaming queries over streaming data.The VLDB Journal, 12(2):140–156.

[12] Chaudhuri, S., Motwani, R., and Narasayya, V. R. (1999). On randomsampling over joins. InProceedings of the 1999 ACM SIGMOD Interna-tional Conference on Management of Data, pages 263–274, Philadelphia,Pennsylvania, USA.

[13] Chen, J., DeWitt, D. J., Tian, F., and Wang, Y. (2000). NiagaraCQ:Ascalable continuous query system for internet databases. InProceedings ofthe 2000 ACM SIGMOD International Conference on Management of Data,pages 379–390, Dallas, Texas, USA.

[14] Chirkova, R., Halevy, A. Y., and Suciu, D. (2001). A formal perspectiveon the view selection problem. InProceedings of the 2001 InternationalConference on Very Large Data Bases, pages 59–68, Roma, Italy.

[15] Das, A., Gehrke, J., and Riedewald, M. (2003). Approximate join pro-cessing over data streams. InProceedings of the 2003 ACM SIGMOD In-ternational Conference on Management of Data, pages 40–51, San Diego,California, USA.

[16] Deshpande, A. and Hellerstein, J. M. (2004). Lifting the burden of historyfrom adaptive query processing. InProceedings of the 2004 InternationalConference on Very Large Data Bases, pages 948–959, Toronto, Canada.

[17] Ding, L., Mehta, N., Rundensteiner, E., and Heineman, G. (2004). Joiningpunctuated streams. InProceedings of the 2004 International Conferenceon Extending Database Technology, Heraklion, Crete, Greece.

[18] Ding, L. and Rundensteiner, E. A. (2004). Evaluating window joinsoverpunctuated streams. InProceedings of the 2004 International Conference onInformation and Knowledge Management, pages 98–107, Washington DC,USA.

[19] Dingel, J. and Strom, R., editors (2005).Proceedings of the 2005 Inter-national Workshop on Distributed Event Based Systems, Columbus, Ohio,USA.

[20] Dittrich, J.-P., Seeger, B., Taylor, D. S., and Widmayer, P. (2002).Pro-gressive merge join: A generic and non-blocking sort-based join algorithm.In Proceedings of the 2002 International Conference on Very Large DataBases, pages 299–310, Hong Kong, China.

[21] Garcia-Molina, H., Labio, W., and Yang, J. (1998). Expiring data inawarehouse. InProceedings of the 1998 International Conference on VeryLarge Data Bases, pages 500–511, New York City, New York, USA.


[22] Golab, L., Garg, S., andOzsu, T. (2004). On indexing sliding windows overon-line data streams. InProceedings of the 2004 International Conferenceon Extending Database Technology, Heraklion, Crete, Greece.

[23] Golab, L. andOzsu, M. T. (2003). Processing sliding window multi-joins in continuous queries over data streams. InProceedings of the 2003International Conference on Very Large Data Bases, pages 500–511, Berlin,Germany.

[24] Golab, L. andOzsu, M. T. (2005). Update-pattern-aware modeling andprocessing of continuous queries. InProceedings of the 2005 ACM SIG-MOD International Conference on Management of Data, pages 658–669,Baltimore, Maryland, USA.

[25] Gupta, A. and Mumick, I. S., editors (1999).Materialized Views: Tech-niques, Implementations, and Applications. MIT Press.

[26] Halevy, A. Y. (2001). Answering queries using views: A survey. TheVLDB Journal, 10(4):270–294.

[27] Hammad, M. A., Aref, W. G., and Elmagarmid, A. K. (2003a). Streamwindow join: Tracking moving objects in sensor-network databases. InPro-ceedings of the 2003 International Conference on Scientific and StatisticalDatabase Management, pages 75–84, Cambridge, Massachusetts, USA.

[28] Hammad, M. A., Franklin, M. J., Aref, W. G., and Elmagarmid, A. K.(2003b). Scheduling for shared window joins over data streams. InPro-ceedings of the 2003 International Conference on Very Large Data Bases,pages 297–308, Berlin, Germany.

[29] Ives, Z. G., Florescu, D., Friedman, M., Levy, A. Y., and Weld, D. S.(1999). An adaptive query execution system for data integration. InProceed-ings of the 1999 ACM SIGMOD International Conference on Managementof Data, pages 299–310, Philadelphia, Pennsylvania, USA.

[30] Kang, J., Naughton, J. F., and Viglas, S. (2003). Evaluating windowjoins over unbounded streams. InProceedings of the 2003 InternationalConference on Data Engineering, pages 341–352, Bangalore, India.

[31] Liu, B. and Rundensteiner, E. A. (2005). Cost-driven general join viewmaintenance over distributed data sources. InProceedings of the 2005 Inter-national Conference on Data Engineering, pages 578–579, Tokyo, Japan.

[32] Madden, S., Shah, M. A., Hellerstein, J. M., and Raman, V. (2002).Con-tinuously adaptive continuous queries over streams. InProceedings of the2002 ACM SIGMOD International Conference on Management of Data,Madison, Wisconsin, USA.

[33] Mokbel, M. F., Lu, M., and Aref, W. G. (2004). Hash-merge join: Anon-blocking join algorithm for producing fast and early join results. In


Proceedings of the 2004 International Conference on Data Engineering,pages 251–263, Boston, Massachusetts, USA.

[34] Olken, F. (1993).Random Sampling from Databases. PhD thesis, Uni-versity of California at Berkeley.

[35] Quass, D., Gupta, A., Mumick, I. S., and Widom, J. (1996). Makingviews self-maintainable for data warehousing. InProceedings of the 1996International Conference on Parallel and Distributed Information Systems,pages 158–169, Miami Beach, Florida, USA.

[36] Raman, V., Deshpande, A., and Hellerstein, J. M. (2003). Using state mod-ules for adaptive query processing. InProceedings of the 2003 InternationalConference on Data Engineering, pages 353–364, Bangalore, India.

[37] Srivastava, U. and Widom, J. (2004). Memory-limited execution of win-dowed stream joins. InProceedings of the 2004 International Conferenceon Very Large Data Bases, pages 324–335, Toronto, Canada.

[38] Tao, Y., Yiu, M. L., Papadias, D., Hadjieleftheriou, M., and Mamoulis, N.(2005). RPJ: Producing fast join results on streams through rate-based opti-mization. InProceedings of the 2005 ACM SIGMOD International Confer-ence on Management of Data, pages 371–382, Baltimore, Maryland, USA.

[39] Tatbul, N., Cetintemel, U., Zdonik, S. B., Cherniack, M., and Stonebraker,M. (2003). Load shedding in a data stream manager. InProceedings of the2003 International Conference on Very Large Data Bases, pages 309–320,Berlin, Germany.

[40] Tucker, P. A., Maier, D., Sheard, T., and Fegaras, L. (2003).Exploitingpunctuation semantics in continuous data streams.IEEE Transactions onKnowledge and Data Engineering, 15(3):555–568.

[41] Urhan, T. and Franklin, M. J. (2001). Dynamic pipeline schedulingforimproving interactive query performance. InProceedings of the 2001 In-ternational Conference on Very Large Data Bases, pages 501–510, Roma,Italy.

[42] Viglas, S. D. and Naughton, J. F. (2002). Rate-based query optimization forstreaming information sources. InProceedings of the 2002 ACM SIGMODInternational Conference on Management of Data, pages 37–48, Madison,Wisconsin, USA.

[43] Viglas, S. D., Naughton, J. F., and Burger, J. (2003). Maximizing theoutput rate of multi-way join queries over streaming information sources.In Proceedings of the 2003 International Conference on Very Large DataBases, pages 285–296, Berlin, Germany.

[44] Wilschut, A. N. and Apers, P. M. G. (1991). Dataflow query execution ina parallel main-memory environment. InProceedings of the 1991 Interna-


tional Conference on Parallel and Distributed Information Systems, pages68–77, Miami Beach, Florida, USA.

[45] Xie, J., Yang, J., and Chen, Y. (2005). On joining and caching stochasticstreams. InProceedings of the 2005 ACM SIGMOD International Confer-ence on Management of Data, pages 359–370, Baltimore, Maryland, USA.

[46] Zhuge, Y., Garcia-Molina, H., and Wiener, J. L. (1998). Consistencyalgorithms for multi-source warehouse view maintenance.Distributed andParallel Databases, 6(1):7–40.

Chapter 11

INDEXING AND QUERYING DATA STREAMS

Ahmet BulutCitrix Online Division5385 Hollister AvenueSanta Barbara, CA 93111

[email protected]

Ambuj K. SinghDepartment of Computer ScienceUniversity of California Santa BarbaraSanta Barbara, CA 93106

[email protected]

Abstract Online monitoring of data streams poses a challenge in many data-centric appli-cations including network traffic management, trend analysis, web-click streams,intrusion detection, and sensor networks. Indexing techniques used in these ap-plications have to be time and space efficient while providing a high quality ofanswers to user queries: (1) queries that monitor aggregates, such as finding sur-prising levels (“volatility” of a data stream), and detecting bursts, and (2) queriesthat monitor trends, such as detecting correlations and finding similar patterns.Data stream indexing becomes an even more challenging task, when we take intoaccount the dynamic nature of underlying raw data. For example, bursts of eventscan occur at variable temporal modalities from hours to days to weeks. We focuson a multi-resolution indexing architecture. The architecture enables the discov-ery of “interesting” behavior online, provides flexibility in user query definitions,and interconnects registered queries for real-time and in-depth analysis.

Keywords: stream indexing, monitoring real-time systems, mining continuous data flows,multi-resolution index, synopsis maintenance, trend analysis, network trafficanalysis.


1. Introduction

Raw stream data, such as faults and alarms generated by network traffic monitorsand log records generated by web servers, are almost always at low level andtoo large to maintain in main memory. One can instead summarize the dataand compute synopsis structures at meaningful abstraction levels on the fly.The synopsis is a small space data structure, and can be updated incrementallyas new stream values arrive. Later in operational cycle, it can be usedtodiscover interesting behavior, which prompts in-depth analysis at lower levelsof abstraction [10].

Consider the following application in astrophysics: the sky is constantlyobserved for high-energy particles. When a particular astrophysicalevent hap-pens, a shower of high-energy particles arrives in addition to the backgroundnoise. This yields an unusually high number of detectable events (high-energyphotons) over a certain time period, which indicates the existence of aGammaRay Burst. If we know the duration of the shower, we can maintain a counton the total number of events over sliding windows of the known window sizeand raise an alarm when the moving sum is above a threshold. Unfortunately,in many cases, we cannot predict the duration of the burst period. The burst ofhigh-energy photons might last for a few milliseconds, a few hours, or even afew days [31].

Finding similar patterns in a time series database is a well studied prob-lem [1, 13]. The features of a time series sequence are extracted using aslidingwindow, and inserted into an index structure for query efficiency. However,such an approach is not adequate for data stream applications, since it requiresa time consuming feature extraction step with each incoming data item. For thispurpose, incremental feature extraction techniques that use the previous featurein computing the new feature have been proposed to accelerate per-item process-ing [30]. A batch technique can further decrease the per-item processing cost bycomputing a new feature periodically instead of every time unit [22]. A majorityof these techniques assumea priori knowledge on query patterns. However ina real world situation, a user might want to know all time periods during whichthe movement of a particular stock follows a certain interesting trend, whichitself can be generated automatically by a particular application [26]. In order toaddress this issue, a multi-resolution indexing scheme has been proposed [16].This work addresses off-line time series databases, and does not consider howwell the proposed scheme extends to a real-time streaming algorithm.

Continuous queries that run indefinitely, unless a query lifetime has beenspecified, fit naturally into the mold of data stream applications. Examples ofthese queries include monitoring a set of conditions or events to occur, detectinga certain trend in the underlying raw data, or in general discovering relationsbetween various components of a large real time system. The kinds of queries

Indexing and Querying Data Streams 239

that are of interest from an application point of view can be listed as follows:(1) monitoring aggregates, (2) monitoring or finding patterns, and (3) detectingcorrelations. Each of these queries requires data management over somehistoryof values, and not just over the most recently reported values [9]. Forexamplein case of aggregate queries, the system monitors whether the current windowaggregate deviates significantly from that aggregate in most time periods ofthe same size. In case of correlation queries, the self-similar nature of sensormeasurements may be reflected as temporal correlations at some resolution overthe course of the stream [24]. Therefore, the system has to maintain historicaldata along with the current data in order to be able to answer these queries.

A key vision in developing stream management systems of practical value isto interconnect queries in a monitoring infrastructure. For example, an unusualvolatility of a stream may trigger an in-depth trend analysis. Unified system so-lutions can lay ground for tomorrow’s information infrastructures by providingusers with a rich set of interconnected querying capabilities [8].

2. Indexing Streams

In this section, we introduce a multi-resolution indexing architecture, and thenlater in Section 3, show how it can be utilized to monitor user queries efficiently.Multi-resolution approach imposes an inherent restriction on what constitutesa meaningful query. The core part of the scheme is thefeature extractionatmultiple resolutions. Adynamicindex structure is used to index features forquery efficiency. The system architecture is shown in Figure 11.1. The keyarchitecture aspects are:

The features at higher resolutions are computed using the features atlower resolutions; therefore, all features are computed in a single pass.

The system guarantees the accuracy provided to user queries by provableerror bounds.

The index structure has tunable parameters to trade accuracy for speedand space. The per-item processing cost and the space overhead can betuned according to the application requirements by varying the updaterate and the number of coefficients maintained in the index structure.

2.1 Preliminaries and definitions

We adapt the use ofx[i] to refer to thei-th entry of streamx, andx[i1 : i2] torefer to the subsequence of entries at positionsi1 throughi2.

Definition 2.1 A feature is the result of applying a characteristic functionover a possibly normalized set of stream values in order to acquire a higherlevel information or concept.


R*-Tree

AR Model 0

R*-Tree Level 0

Stream 1

Stream M

MBRs

AR Model 1

AR Model 2

R*-Tree Level 2

MBRs

Stream Processing Engine

MBRs

Level 1

R*-Tree

R*-Tree

R*-Tree

Figure 11.1. The system architecture for a multi-resolution index structure consisting of3levels and stream-specific auto-regressive (AR) models for capturing multi-resolution trends inthe data.

The widely used characteristic functions are (1) aggregate functions, such assummation, maximum, minimum, and average, (2) orthogonal transformations,such as discrete wavelet transform (DWT) and discrete fourier transform (DFT),and (3) piecewise linear approximations. Normalization is performed in caseof DWT, DFT, and linear approximations. The interested reader can refer tothe Sections 3.2 and 3.3 for more details.

2.2 Feature extraction

The features at a specific resolution are obtained with a sliding window of afixed lengthw. The sliding window size doubles as we go up a resolution, i.e.,a level. In the rest of the paper, we will use the terms “level” and “resolution”interchangeably. We denote a newly computed feature at resolutioni asFi.Figure 11.2 shows an example where we have three resolutions with corre-


sponding sliding window sizes of2, 4 and8. With each arrival of a new streamvalue, featuresF0, F1, andF2, i.e., one for each resolution, can be computed.However, this requires maintaining all the stream values within a time windowequal to the size of the largest sliding window, i.e.,8 in our running example.The per-item processing cost and the space required is linear in the size of thelargest window [16].

w=8

w=4

w=2

stream incoming

Figure 11.2. Exact feature extraction, update rateT = 1.

For a given windoww of valuesy = x[t−w + 1], . . . , x[t], anincrementaltransformationF (y) is used to compute features. The type of transformationFdepends on the monitoring query. For example,F is SUM for burst detection,MAX−MIN for volatility detection, and DWT for detecting correlations andfinding surprising patterns. For most real time series, the firstf (f << w)DWT coefficients retain most of the energy of the signal. Therefore, we cansafely disregard all but the very first few coefficients to retain the salient features(e.g., the overall trend) of the original signal.

w=8

w=4

w=2

stream incoming

Figure 11.3. Incremental feature extraction, update rateT = 1.

Using an incremental transformation leads to a more efficient way of com-puting features at all resolutions. Level-1 features are computed using level-0


features, and level-2 features are computed using level-1 features. Ingeneral,we can use lower level features to compute higher level features [3]. Fig-ure 11.3 depicts this new way of computation. This new algorithm has a lowerper-item processing cost, since we can computeF1 andF2 in constant time.The following lemma establishes this result.

Lemma 11.1 The new featureFj at levelj for the subsequencex[t−w+1 : t]can be computed “exactly” using the featuresF ′

j−1 andFj−1 at levelj− 1 forthe subsequencesx[t−w + 1 : t−w/2] andx[t−w/2 + 1 : t] respectively.

Proof Fj is max(F ′j−1,Fj−1), min(F ′

j−1,Fj−1), F ′j−1 + Fj−1 for MAX,

MIN, and SUM respectively. For DWT, see Lemma 11.4 in Section 2.4.

However, the space required for this scheme is also linear in the size of thelargest window. The reason is that we need to maintain half of the featuresat the lower level to compute the feature at the upper level incrementally. Ifwe can trade accuracy for space, then we can decrease the space overhead bycomputing features approximately. At each resolution level, everyc of thefeature vectors are combined in a box, or in other words, a minimum boundingrectangle (MBR). Figure 11.4 depicts this scheme forc = 2. Since each MBRB containsc features, it has an extent along each dimension. In case of SUM,B[1] corresponds to the smallest sum, andB[2] corresponds to the largest sumamong allc sums. In general,B[2i] denotes the low coordinate andB[2i+ 1]denotes the high coordinate along thei-th dimension. Note that for SUM,MAX and MIN, B has a single dimension. However, for DWT the number ofdimensionsf is application dependent.

w=8

w=4

w=2

stream incoming

Figure 11.4. Approximate feature extraction, update rateT = 1.


This new approach decreases the space overhead by a factor ofc. Since theextent information of the MBRs is used in the computation, the newly computedfeature will also be an extent. The following lemma proves this result.

Lemma 11.2 The new featureFj at levelj can be computed “approximately”using the MBRsB1 andB2 that contain the featuresF ′

j−1 andFj−1 at levelj − 1 respectively.

Proof

max(B1[1], B2[1]) ≤ Fj ≤ max(B1[2], B2[2])

min(B1[1], B2[1]) ≤ Fj ≤ min(B1[2], B2[2])

B1[1] +B2[1] ≤ Fj ≤ B1[2] +B2[2]

See Lemma 11.5 in Section 2.4

for MAX, MIN, SUM and DWT respectively.

Using MBRs instead of individual features exploits the fact that there is astrong spatio-temporal correlation between the consecutive features. Therefore,it is natural to extend the computation scheme to eliminate this redundancy.Instead of computing a new feature at each data arrival, one can employ abatchcomputation such that a new feature is computed periodically, at everyT timeunit. This allows us to maintain features instead of MBRs. Figure 11.5 showsthis scheme withT = 2. The new scheme has a clear advantage in terms ofaccuracy; however it can dismiss potentially interesting events that may occurbetween the periods.

w=8

w=4

w=2

stream incoming

Figure 11.5. Incremental feature extraction, update rateT = 2.

Depending on the box capacity and the update rateTj at a given levelj(the rate at which we compute a new feature), there are two general featurecomputation algorithms:


Online algorithm: Update rateTj is equal to1. The box capacityc isvariable. It is used for aggregate monitoring queries.

Batch algorithm: Update rateTj is greater than1, i.e.,Tj > 1. Thebox capacity is set toc = 1. The settingTj=W can be used for findingsurprising patterns and detecting correlations. SWAT is a batch algorithmwith Tj = 2j [7].

We establish the time and space complexity of a given algorithm in terms ofc andTj in the following theorem. Assume thatW denotes the sliding windowsize at the lowest resolutionj = 0.

Theorem 11.3 Fj at levelj for a stream can be computed incrementally inconstant time and in spaceΘ(2j−1W/cTj−1).

Proof Fj at levelj is computed using the features at levelj − 1 in constanttime as shown in Lemmas 11.1 and 11.2. The number of features that need tobe maintained at levelj − 1 for incremental computation at levelj is 2j−1W .Therefore, depending on the box capacity and update rate, the space complexityat levelj − 1 is Θ(2j−1W/cTj−1).

2.3 Index maintenance

As new values stream in, new features are computed and inserted into thecorresponding index structures while features that are out of history of interestare deleted to save space. Coefficients are computed at multiple resolutionsstarting from level0 up to a configurable levelJ : at each level a sliding windowis used to extract the appropriate features. Computation of features at higherlevels is accelerated using the MBRs at lower levels. The MBRs belonging to aspecific stream are threaded together in order to provide a sequential access tothe summary information about the stream. This approach results in a constantretrieval time of the MBRs. The complete algorithm is shown in Algorithm 1.

Features at a given level are maintained in a high dimensional index structure.The index combines information from all the streams, and provides a scalableaccess medium for answering queries over multiple data streams. However,each MBR inserted into the index is specific to a single stream. The R*-Treefamily of index structures are used for indexing MBRs at each level [5].An R*-Tree, a variant of R-Tree [15], is a spatial access method, which splits featurespace in hierarchically nested, possibly overlapping MBRs. In order to supportfrequent updates, the techniques for predicting MBR boundaries outlined in [20]can be used to decrease the cost of index maintenance.


Algorithm 1 ComputeCoefficients(Stream S)

Require: BSj,i denotes thei-th MBR at levelj for streamS.

1:begin procedure2: w :=W (the window size at the lowest resolution);3: tnow denotes the current discrete time;4: for j := 0 to J do5: BS

j,i := the current MBR at levelj for streamS;6: if j = 0 then7: y := S[tnow − w + 1 : tnow];8: normalizey if F = DWT ;9: Fj := F (y);

10: else11: find MBRBS

j−1,i1that contains the feature

12: for the subsequenceS[tnow − w + 1 : tnow − w2 ];

13: find MBRBSj−1,i2

that contains the feature14: for the subsequenceS[tnow − w

2 + 1 : tnow];15: Fj := F (BS

j−1,i1, BS

j−1,i2);

16: end if17: if number of features inBS

j,i < c (box capacity)then18: insertFj intoBS

j,i;19: else20: insertBS

j,i into index at levelj;21: start a new MBRBS

j,i+1;22: insertFj intoBS

j,i+1;23: end if24: adjust the sliding window size tow := w * 2;25: end for26:end procedure


2.4 Discrete Wavelet Transform

The approximation coefficients are defined through the inner product ofthe inputsignal withφj,k, the shifted and dilated versions a low-pass scaling functionφ0.In the same vein, the detail coefficients are defined through the inner productof the input signal withψj,k, the shifted and dilated versions the wavelet basisfunctionψ0.

φj,k(t) = 2−j/2φ0(2−jt− k), j, k ∈ Z (11.1)

ψj,k(t) = 2−j/2ψ0(2−jt− k), j, k ∈ Z (11.2)

We show how to compute approximation coefficients. Detail coefficients atlevel j are computed using approximation coefficients at levelj − 1. UsingEquation 11.1, the approximation signal at levelj for the signalx is obtainedby

A(x)j =

∑

k

〈x, φj,k〉φj,k

In the same manner, the approximation signal at levelj + 1 for x is

A(x)j+1 =

∑

k

〈x, φj+1,k〉φj+1,k

To computeA(x)j+1, we need to compute coefficients〈x, φj+1,n〉. Using the

twin-scale relation forφ, we can compute〈x, φj+1,n〉 from 〈x, φj,k〉 [21]. Thiscan mathematically be expressed as

〈x, φj+1,n〉 =∑

k

hk−2n〈x, φj,k〉 (11.3)

Cj,n =∑

k

hn−k〈x, φj,k〉 (11.4)

〈x, φj+1,n〉 = Cj,2n (11.5)

wherehk andh are low-pass reconstruction and decomposition filters respec-tively. Note that the terms “approximation signal” and “approximation coeffi-cients” are used interchangeably.

Lemma 11.4 The approximation coefficients at levelj, 1 ≤ j ≤ J , for asignal x[t − w + 1 : t] can be computed exactly using the approximationcoefficients at levelj−1 for the signalsx[t−w+1 : t−w/2] andx[t−w/2+1 :t].

Proof Letx, x1 andx2 denote signalsx[t−w+1 : t], x[t−w+1 : t−w/2],andx[t − w/2 + 1 : t] respectively. At a particular scalej − 1, the shape ofthe wavelet scaling functionφj−1,0 is kept the same, while it is translated to


obtain the wavelet family members at different positions,k. Sincex1 andx2

constitute the two halves of the whole signalx, the coefficient〈x, φj−1,k〉 iscomputed using〈x1, φj−1,k〉 and〈x2, φj−1,k〉 as follows

x′1 =

x1[n], 1 ≤ n ≤ w/20, w/2 + 1 ≤ n ≤ w

x′2 =

0, 1 ≤ n ≤ w/2x2[n− w/2], w/2 + 1 ≤ n ≤ w

〈x, φj−1,k〉 = 〈x′1 + x′2, φj−1,k〉 = 〈x′1, φj−1,k〉+ 〈x′2, φj−1,k〉This result is due to the linearity of the wavelet transformation. Using Equa-

tions 11.4, 11.5, and the coefficients〈x, φj−1,k〉, we can obtain the approxima-

tion signalA(x)j at levelj for x.

DWT X

-Axis

DWT

Y-Axis

X-Axis

Y-A

xis

p2 p1

p3 p4

X-Axis

Y-A

xis

p2 p1

p3 p4

Original space Feature space

Figure 11.6. Transforming an MBR using discrete wavelet transform. Transformation corre-sponds to rotating the axes (the rotation angle =45 for Haar wavelets)

Lemma 11.5 One can compute approximation coefficients on a hyper-rectangleB ∈ ℜf ′

with low coordinates[xl1 , . . . , xlf ′]and high coordinates[xh1 , . . . , xhf ′

].

Proof The most recent approximation coefficients for a sliding window ofvaluesx at a given resolutionj can be computed on the extent information ofthe MBRsBj−1,i1 andBj−1,i2 in ℜf at levelj− 1 that contain the coefficientsof the corresponding two halves ofx. These MBRs are merged together usingLemma 11.4 to get an MBRB in ℜf ′

, wheref ′ is larger thanf (e.g.,f ′ is 2ffor Haarwavelets). The MBRB approximates the coefficients at levelj−1 forx. First compute the coefficients for each one of the2f ′

corners ofB, and findthe tightest MBRA(B) in ℜf that encloses the resulting2f ′

coefficients inℜf .


The coefficients at levelj for x, i.e., the featureA(x)j , lies inside the MBRA(B).

This is true for any such unitary transformation as wavelet transformation thatrotates the axes as shown in Figure 11.6. This algorithm has a processing timeof Θ(2f ′

f), wheref andf ′ are constant for a specific application.

Error bound. Wavelet transformation corresponds to the rotation of theaxes in the original space. An input MBRB in the original space is transformedto a new shapeS in the feature space (see Figure 11.6). The resulting shapeSis projected on each dimension in the feature space, and the tightest MBRA(B)

that enclosesS is identified. The MBRA(B) contains the featureA(x). Thevolume ofA(B) is a function of the projection along each dimension. Sincethe wavelet transformation is a distance preserving transformation, the lengthalong each dimension can be at most two times the original length.

3. Querying Streams

In this section, monitoring queries that are important from an application pointof view, such as deviant aggregates, interesting patterns, and trend correlations,are presented.

3.1 Monitoring an aggregate query

In this class of queries, aggregates of data streams are monitored over a setof time intervals [31]: “Report all occurrences of Gamma Ray bursts froma timescale of minutes to a timescale of days”. Formally, given aboundedwindow sizew, an aggregate functionF , and a thresholdτ associated with thewindow, the goal is to report all those time instances such that the aggregateapplied to the subsequencex[t−w+ 1 : t] exceeds the corresponding windowthreshold, i.e., check if

F (x[t− w + 1 : t]) ≥ τ (11.6)

wheret denotes the current time. The threshold valuesτ can either be specifiedas part of the input, or they can be determined using historical data.

The algorithm. Assume that the query window size is a multiple ofW .An aggregate query with window sizew and thresholdτ is answered by firstpartitioning the window into multiple sub-windows,w1,w2,. . .,wn such that0 ≤ j1 < . . . < ji < ji+1 < . . . < jn ≤ J , andwi = W2ji . For agiven window of lengthbW , the partitioning corresponds to the ones in thebinary representation ofb such that

∑ni=1 2ji = b. The current aggregate over

a window of sizew is computed using the sub-aggregates for sub-windows inthe partitioning. Assume thatW = 2 andc = 2. Consider a query windoww = 26. The binary representation ofb = 13 is1101, and therefore the query is


partitioned into three sub-windowsw0 = 2,w2 = 8, andw3 = 16. Figure 11.7shows the decomposition of the query and the composition of the aggregatetogether. The current aggregate over a window of sizew = 26 is approximatedusing the extents of MBRs that contain the corresponding sub-aggregates. Thecomputation is approximate in the sense that the algorithm returns an intervalF such that the upper coordinateF [2] is always greater than or equal to the trueaggregate. IfF [2] is larger than the thresholdτ , the most recent subsequence oflengthw is retrieved, and the true aggregate is computed. If this value exceedsthe query threshold, an alarm is raised. The complete algorithm is shown inAlgorithm 2.

aggregate query w=26

w=16

w=8

w=2

w=4

w 3 =16 w 2 =8 w 0 =2

MBRs in level-3

index

Figure 11.7. Aggregate query decomposition and approximation composition for a query win-dow of sizew = 26.

The accuracy. The false alarm rate of this approximation is quantified asfollows: assume that bursts of events are monitored, i.e.,F is SUM. LetXdenote the sum within sliding windoww = bW . If the thresholdτ is set toµX(1− Φ(p)), the inequation

Pr

(X − µX

µX≥ τ − µX

µX

)≤ p (11.7)


Algorithm 2 AggregateQuery(Stream S, Window w, Thresholdτ )1:begin procedure2: initialize t to tnow, the current discrete time;3: partitionw into n parts asw1, w2, . . . , wn;4: initialize aggregateF ;5: for i := 1 to i := n do6: find the resolution levelj such thatwi = W2j ;7: MBR B contains the feature onS[t− wi + 1 : t];8: merge sub-aggregateB toF := F (B,F);9: adjust offset tot := t− wi for next sub-window;

10: end for11: if τ ≤ F [2] then12: retrieveS[tnow − w + 1 : tnow];13: if τ ≤ F (S[tnow − w + 1 : tnow]) then14: raise an alarm;15: end if16: end if17:end procedure

holds for a given sufficiently smallp, whereΦ denotes the normal cumulativedistribution function. Monitor the burst based on windows with sizeTw suchthat1 ≤ T < 2, where2j−1W < w ≤ 2jW . This approach corresponds tomonitoring the burst via one of the levels in the index structure [31]. LetZdenote the sum within sliding windowTw. We assume that

Z − µZ

µZ∼ Norm(0, 1) (11.8)

AssumingµZ = Tµ(X), the false alarm rate is equal toPr(Z > τ), whichimplies

Pr

(Z − Tµ(X)

Tµ(X)≥ τ − Tµ(X)

Tµ(X)

)= Φ

(1− 1− Φ−1(p)

T

)(11.9)

According to Equation 11.9, for a fixed value ofp, the smallerT is, the smalleris the false alarm rate. If sub-aggregates for sub-windowsw1,w2,. . .,wn areused for computing the final aggregate on a given query window of sizew andthresholdτ , a smallerT can be achieved. The sub-aggregate for sub-windowwi

is stored in an MBR at levelji. An MBR at levelji corresponds to a monitoringwindow of size2jiW + c − 1. Then, effectively a burst is monitored using awindow of sizebW + log b ∗ (c− 1) such that:

T ′ =bW + log b ∗ (c− 1)

bW= 1 +

log b ∗ (c− 1)

bW(11.10)


whereT ′ decreases with increasingb. For example, forc = W = 64 andb = 12, we haveT ′ = 1.2987 andT = 1.3333. This implies that the sub-window sum approximation reduces the false alarm rate to a minimal amountwith the optimal being atT ′ = 1. In fact, the optimal is reached withc = 1.However the space consumption in this case is much larger.

3.2 Monitoring a pattern query

In this class of queries, a pattern database is continuously monitored over dy-namic data streams: “Identify all temperature sensors in a weather monitoringsensornet that currently exhibit an interesting trend”. Formally, given aquerysequenceQ and a threshold valuer, find the set of streams that are withindistancer to the query sequenceQ. The distance measure we adopt is theEuclidean distance (L2) between the corresponding normalized sequences. Wenormalize a window of valuesx[1], . . . , x[w] as follows:

x[i] =x[i]√

w ∗Rmaxi = 1, . . . , w (11.11)

thereby mapping it to the unit hyper-sphere. We establish the notion of similaritybetween sequences as follows: a stream sequencex and a query sequenceQare considered to ber-similar if

L2(x, Q) =

√√√√N−1∑

i=0

(xi − Qi)2 ≤ r (11.12)

The online algorithm. Given a query sequenceQ and a threshold valuer, partitionQ into multiple sub-queries,Q1, Q2, . . . , Qn such that0 ≤ j1 <. . . < ji < ji+1 < . . . < jn ≤ J , and|Qi| = W2ji . Assume that the firstsub-queryQ1 has resolutionj1. A range query with radiusr is performedon the index constructed at resolutionj1. The initial candidate box setR isrefined using the hierarchical radius optimization proposed in [16]. Briefly,for each MBRB ∈ R, this technique is used to refine the original radiusr tor′ =

√r2 − dmin(Q1, B)2 for the next sub-queryQ2, wheredmin(p,B) for a

point p and an MBRB is defined as the minimum Euclidean distance of thequery pointp to the MBRB [25]. The same procedure is applied recursivelyuntil the last sub-queryQn is processed, resulting in a final set of MBRsC tocheck for true matches. The complete algorithm is shown in Algorithm 3.

The batch algorithm. Let the update rate for each index levelj beTj = W .The stream is divided intoW -step sliding windows of sizew. Let |S| denotethe size of the stream. Then, there are⌊(|S|−w+1)/W ⌋many such windows.Given a query sequenceQ, W -many prefixes of sizew are extracted asQ[0 :


Algorithm 3 PatternQueryOnline(Query Q)1:begin procedure2: partitionQ into n parts asQ1, Q2, . . . , Qn;3: find the resolution levelj1 such that|Q1| = W2j1 ;4: R := RangeQuery(Indexj1 , DWT(Q1),Q.r);5: C := HierarchicalRadiusRefinement(R,Q);6: post-processC to discard false alarms;7:end procedure

w−1], Q[1 : w], . . . , Q[W−1 : w+W−1]. Each prefix query is used to identifypotential candidates. In order to clarify the ensuing development, we note thata single prefix query would suffice in case an online algorithm withTj = 1 wasused for index construction. This approach is similar to the technique proposedin [22], where a single resolution index is constructed using a sliding windowof maximum allowable sizew that satisfies1 ≤ ⌊(min(Q)−W +1)/w⌋. Notethat min(Q) is thea priori information regarding the minimum query length.However, in a multi-resolution index, a given query can be answered using anyindex at resolutionj that satisfies1 ≤ ⌊(|Q| −W + 1)/(2jW )⌋. The accuracyof this multi-resolution search algorithm can be improved by extracting disjointwindows along with each prefix in order to refine the original query radiususinga multi-piece search technique [13]. The number of such disjoint windows isatmostp = ⌊(|Q|−W +1)/w⌋. We illustrate these concepts on a query windowof size|Q| = 9 as shown in Figure 11.8, whereJ = 1 andW = 2. The prefixesare shown asi = 0 andi = 1 along with the corresponding disjoint windows.Each and every feature extracted overQ is inserted into a query MBRB. TheMBR B is extended in each dimension by a fraction of the query radius, i.e.,r/√p. A range query is performed on the index at levelj usingB, and a setR

of candidate features is retrieved. The setR is post-processed to discard falsealarms. The complete algorithm is shown in Algorithm 4.

3.3 Monitoring a correlation query

In this class of queries, all stream pairs that are correlated within a user spec-ified thresholdr at some level of abstraction are reported continuously. Thecorrelation between two sequencesx andy can be reduced to the Euclideandistance between theirz-norms [30]. Thez-norm of a sequencex[1], . . . , x[w]is defined as follows:

x[i] =x[i]− µx√∑wi=1(x[i]− µx)2

i = 1, . . . , w (11.13)

whereµx is the arithmetic mean. The correlation coefficient between sequencesx andy is computed using theL2 distance betweenx andy as1−L2

2(x, y)/2.


query sequence

|Q|=9

disjoint sub-sequences of size w=4

i=0

i=1

query MBR B

feature extraction at every 2 nd time

w=4 w=4

w=4

stream incoming

level-1 index

Figure 11.8. Subsequence query decomposition for a query window of size|Q| = 9.

Algorithm 4 PatternQueryBatch(Query Q)1:begin procedure2: find the largest levelj such that2jW +W − 1 ≤ |Q|;3: initialize query MBRB to empty;4: letw be equal to2jW , level-j sliding window size;5: for i := 0 to i := W − 1 do6: for k := 0 to k := ⌊(|Q| − i)/w⌋ do7: extractkth disjoint subsequence of the query8: sequence intoy := Q[i+ kw : i+ (k + 1)w − 1];9: insert DWT(y) into MBRB;

10: end for11: end for12: compute radius refinement factorp := ⌊(|Q| −W + 1)/w⌋;13: enlarge query MBRB byQ.r/

√p;

14: R := RangeQuery(Indexj,B);15: post-processR to discard false alarms;16:end procedure

The algorithm. Whenever a new featureFj of a streamS is computed atlevel j, a range query on the index at levelj is performed withFj as the querycenter and the radius set to the correlation thresholdr. In a system withM


synchronized streams, this involves execution ofO(M) range queries at everydata arrival.

4. Related Work

Shifted-Wavelet Tree (SWT) is a wavelet-based summary structure proposedfor detecting bursts of events on a stream of data [31]. For a given setof querywindowsw1,w2,. . .,wm such that2LW ≤ w1 ≤ w2 ≤ . . . ≤ wm ≤ 2UW ,SWT maintainsU −Lmoving aggregates using a wavelet tree for incrementalcomputation. A windowwi is monitored by the lowest levelj,L ≤ j ≤ U , thatsatisfieswi ≤ 2jW . Therefore, associated with each levelj,L ≤ j ≤ U , thereis a thresholdτj equal to the smallest of the thresholds of windowswi1 , . . . , wij

monitored by that level. Whenever the moving sum at some levelj exceedsthe level thresholdτj , all query windows associated with this level are checkedusing a brute force approach.

MR-Index addresses variable length queries over time series data [16].Waveletsare used to extract features from a time series at multiple resolutions. At eachresolution, a set of feature vectors are combined into an MBR and stored se-quentially in the order they are computed. A given query is decomposed intomultiple sub-queries such that each sub-query has resolution corresponding toa resolution at the index. A given set of candidate MBRs are refined using eachquery as a filter to prune out non-potential candidates.

Versions of piecewise constant approximation are proposed for time seriessimilarity matching. Specifically, adaptive piecewise constant approximation(APCA) represents data regions of great fluctuations with several short seg-ments, while data regions of less fluctuations are represented with fewer, longsegments [17]. An extension of this approximation allows error specificationfor each point in time [23]. The resulting approach can approximate data withfidelity proportional to its age. GeneralMatch, a refreshingly new idea in simi-larity matching, divides the data sequences into disjoint windows, and the querysequence into sliding windows [22]. This approach is the dual of the conven-tional approaches, i.e., dividing the data sequence into sliding windows, andthe query sequence into disjoint windows. The overall framework is based onanswering pattern queries using a single-resolution index built on a specificchoice of window size. The allowed window size depends on the minimumquery length, which has to be provided a-priori before the index construction.

Methods based on multi-variate linear regression are considered for analyz-ing co-evolving time sequences [28]. For a given streamS, its current value(dependent variable) is expressed as a linear combination of values of the sameand other streams (independent variables) under sliding window model. Givenv independent variables and a dependent variabley with N samples each, the


model identifies the bestb independent variables in order to compute the currentvalue of the dependent variable inO(Nbv2) time.

StatStream is a state of the art system proposed for monitoring a large num-ber of streams in real time [30]. It subdivides the history of a stream into afixed number of basic windows and maintains DFT coefficients for each basicwindow. This allows a batch update of DFT coefficients over the entire history.It superimposes an orthogonal regular grid on the feature space, andpartitionsthe space into cells of diameterr, the correlation threshold. Each stream ismapped to a number of cells (exactly how many depends on the “lag time”) inthe feature space based on a subset of its DFT coefficients. It uses proximity inthis feature space to report correlations [6].

5. Future Directions

We are witnessing the blurring of the traditional boundaries between Networksand Databases, especially in the emerging areas of sensor and peer-to-peer net-works. Data stream processing in these application domains requires networkeddata management, solutions of which borrow ideas from both disciplines. Webelieve that researchers from these two communities should share their ex-pertise, results, terminologies, and contributions. This exchange can promoteideas that will influence and foster continued research in the areas of sensorand peer-to-peer networks. The following three research avenues are promis-ing future directions to pursue further: (1) Distributed monitoring systems, (2)Probabilistic modeling of sensor networks, and (3) Publish-subscribe systems.

5.1 Distributed monitoring systems

In today’s rapidly growing networks, data streams arrive at widely dispersed lo-cations. Assume that a system administrator wants to analyze the local networktraffic and requires access to data collections that are maintained at differentlocations. The rapid growth of such collections fed by data streams makes itvirtually impossible to simply store a collection at every location where it ispossibly queried. This prompts a need to design more scalable approachesfordisseminating the information of a data stream. Each peer monitoring stationcharacterizes its stream of data in terms of a model (signature) and transmitsthis information to a central site using an adaptive communication protocol.The abstraction levels of signatures collected at the server can be quite differ-ent. A higher level corresponds to coarser statistics. Therefore, it contains lessrepresentative information, and incurs smaller transmission cost. A lower levelcorresponds to finer statistics. Therefore, it has more characteristics informa-tion; however it incurs larger transmission cost. Naturally, there is an interplayof opposing factors, i.e., accuracy vs. overhead. At the server, tasks that in-volve information from multiple clients are executed. The question is to find


an optimal data acquisition strategy that maximizes the conservation of limitedsystem resources [11].

A more specific scenario arises in a sensor-net: anomalous event detectionis a collaborative task, which involves aggregation of measurements from anumber of sensors. Only if a certain number of these sensors signify an alarmand a consensus is reached, a drill-down analysis is performed to collectmoreinformation and take affirmative steps. After computing a local fingerprint on itsstream of data, each sensor needs to diffuse this fingerprint into the netin orderto reach a consensus on the alarming incidents [2]. This work can introduce“reactive monitoring” into sensor networks.

5.2 Probabilistic modeling of sensor networks

Embedded low-power sensing devices revolutionize the way we collect andprocess information for building emergency response systems. Miniature sen-sors are deployed to monitor ever-changing conditions in their surroundings.Statistical models such as stochastic models and multivariate regression enablecapturing intra-sensor and inter-sensor dependencies in order to model sensornetwork data accurately. Such models can be used in backcasting missing sen-sor values, forecasting future data values, and guiding efficient data acquisition.Current mathematical models allow decomposing the main research probleminto subproblems [14]. This in turn leads to a natural way of computing modelcomponents incrementally and in a distributed manner.

Regressive models are proposed for computing inter-scale and intra-scalecorrelations among wavelet coefficients [24]. The magnitude of these coef-ficients is used for detecting interesting events such as seasonal componentsand bursts of events. The wavelet coefficients at a given levelj in the multi-resolution index are expressed as a function of thek previous coefficients atthe same level plus noiseǫ (optionally including coefficients from upper levels)i.e.,

A(x0)j = β1A

(x1)j + . . .+ βkA

(xk)j + ǫj,t (11.14)

where the symbolxi, 0 ≤ i ≤ k, denotesx[t− (i+1)∗2jW +1 : t− i∗2jW ],and the termǫj,t denotes the noise added. Recursive Least Squares is used toupdate these regressive models incrementally [29]. Further research efforts areencouraged on exploring how to use these models for compressing and queryingstream information regarding the past, and more importantly in a feedback loopfor setting the query window parameters automatically and in a semanticallymeaningful manner.

5.3 Content distribution networks

Publish-and-subscribe services provide the ability to create persistent queries orsubscriptions to new content. In a typical content based pub-sub system,con-


tent providers send structured content to instances ofpub-sub service, whichare responsible for sending messages to the subscribers of each particular con-tent [27]. The pub-sub system forms a semantic layer on the top of a monitoringinfrastructure by providing a query interface: events of interest are specifiedusing an appropriate continuous query language [19]. Furthermore, itreal-izes the reactive part of the whole infrastructure by sending notificationsaboutevents of interest to users. Recent advances in application layer multicastforcontent delivery address the scalability issues that usually arise in data streamapplications with large receiver sets [4]. However, the problem of providingreal-time guarantees for time-critical user tasks under stringent constraints stillneeds exploration.

6. Chapter Summary

In this chapter, we presented a space and time efficient architecture to extractfeatures over streams and index these features for improving query performance.The maintenance cost in the index structure is leveraged by computing transfor-mation coefficients online: the coefficients at higher levels are computed overthe index that stores the coefficients at lower levels. This approach decreasesper-item processing time considerably, and minimizes the space required forincremental computation. The index structure has an adaptive time-space com-plexity depending on the update rate and the number of coefficients maintained,and guarantees the approximation quality by provable error bounds.

References

[1] R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity search insequence databases. InFODO, pages 69–84, 1993.

[2] A. Akella, A. Bharambe, M. Reiter, and S. Seshan. Detecting DDoS attackson ISP networks. InMPDS, 2003.

[3] A. Arasu and J. Widom. Resource sharing in continuous sliding-windowaggregates. InVLDB, pages 336–347, 2004.

[4] S. Banerjee, B. Bhattacharjee, and C. Kommareddy. Scalable ApplicationLayer Multicast. InSIGCOMM, pages 205–217, 2002.

[5] N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger. The R*-tree: Anefficient and robust access method for points and rectangles. InSIGMOD,pages 322–331, 1990.

[6] J. Bentley, B. Weide, and A. Yao. Optimal expected time algorithms forclosest point problems. InACM Trans. on Math. Software, volume 6, pages563–580, 1980.

[7] A. Bulut and A. Singh. SWAT: Hierarchical stream summarization in largenetworks. InICDE, pages 303–314, 2003.


[8] A. Bulut and A. Singh. A unified framework for monitoring data streamsin real time. InICDE, pages 44–55, 2005.

[9] D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman,M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams - a new classof data management applications. InVLDB, pages 215–226, 2002.

[10] Y. Chen, G. Dong, J. Han, J. Pei, B. Wah, and J. Wang. Online analyticalprocessing stream data: Is it feasible? InDMKD, 2002.

[11] A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein, and W. Hong.Model-driven data acquisition in sensor networks. InVLDB, pages 588–599, 2004.

[12] P. Dinda. CMU, Aug 97 Load Trace. In Host Load Data Archivehttp://www.cs.northwestern.edu/˜pdinda/LoadTraces/.

[13] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequencematching in time-series databases. InSIGMOD, pages 419–429, 1994.

[14] C. Guestrin, P. Bodi, R. Thibau, M. Paski, and S. Madden. Distributedregression: an efficient framework for modeling sensor network data.InIPSN, pages 1–10, 2004.

[15] A. Guttman. R-trees: A dynamic index structure for spatial searching. InSIGMOD, pages 47–57, 1984.

[16] T. Kahveci and A. Singh. Variable length queries for time series data. InICDE, pages 273–282, 2001.

[17] E. Keogh, K. Chakrabarti, S. Mehrotra, and M. Pazzani. Locallyadap-tive dimensionality reduction for indexing large time series databases. InSIGMOD, pages 151 – 162, 2001.

[18] E. Keogh and T. Folias. Time Series Data Mining Archive. Inhttp://www.cs.ucr.edu/˜eamonn/TSDMA, 2002.

[19] Y. Law, H. Wang, and C. Zaniolo. Query languages and data modelsfordatabase sequences and data streams. InVLDB, pages 492–503, 2004.

[20] M. Lee, W. Hsu, C. Jensen, B. Cui, and K. Teo. Supporting frequentupdates in R-Trees: A bottom-up approach. InVLDB, pages 608–619,2003.

[21] S. Mallat.A Wavelet Tour of Signal Processing. Academic Press, 2 edition,1999.

[22] Y. Moon, K. Whang, and W. Han. General match: a subsequencematch-ing method in time-series databases based on generalized windows. InSIGMOD, pages 382–393, 2002.

[23] T. Palpanas, M. Vlachos, E. Keogh, D. Gunopulos, and W. Truppel. Onlineamnesic approximation of streaming time series. InICDE, pages 338–349,2004.


[24] S. Papadimitriou, A. Brockwell, and C. Faloutsos. AWSOM: Adaptive,hands-off stream mining. InVLDB, pages 560–571, 2003.

[25] N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries.pages 71–79, 1995.

[26] H. Wu, B. Salzberg, and D. Zhang. Online event-driven subsequencematching over financial data streams. InSIGMOD, pages 23–34, 2004.

[27] B. Wyman and D. Werner. Content-based Publish-Subscribe overAPEX.In Internet-Draft, April, 2002.

[28] B. Yi, N. Sidiropoulos, T. Johnson, H. Jagadish, C. Faloutsos, andA. Biliris. Online data mining for co-evolving time sequences. InICDE,2000.

[29] P. Young.Recursive Estimation and Time-Series Analysis: An Introduc-tion. Springer-Verlag, 1984.

[30] Y. Zhu and D. Shasha. Statstream: Statistical monitoring of thousands ofdata streams in real time. InVLDB, pages 358–369, 2002.

[31] Y. Zhu and D. Shasha. Efficient elastic burst detection in data streams. InSIGKDD, pages 336 – 345, 2003.

Chapter 12

DIMENSIONALITY REDUCTION ANDFORECASTING ON STREAMS

Spiros Papadimitriou,1 Jimeng Sun,2 and Christos Faloutsos2

1IBM Watson Research Center,Hawthorne, NY, USA

[email protected]

2Carnegie Mellon UniversityPittsburgh, PA, USA

[email protected], [email protected]

Abstract We consider the problem of capturing correlations and finding hidden variablescorresponding to trends on collections of time series streams. Our proposedmethod, SPIRIT, can incrementally find correlations and hidden variables, whichsummarise the key trends in the entire stream collection. It can do this quickly,with no buffering of stream values and without comparing pairs of streams. More-over, it is any-time, single pass, and it dynamically detects changes. Thedis-covered trends can also be used to immediately spot potential anomalies, todoefficient forecasting and, more generally, to dramatically simplify further dataprocessing.

Introduction

In this chapter, we consider the problem of capturing correlations and findinghidden variables corresponding to trends on collections of semi-infinite, timeseries data streams, where the data consist of tuples withn numbers, one foreach time tickt.

Streams often are inherently correlated (e.g., temperatures in the same build-ing, traffic in the same network, prices in the same market, etc.) and it is possibleto reduce hundreds of numerical streams into just a handful ofhidden variablesthat compactly describe the key trends and dramatically reduce the complexityof further data processing. We propose an approach to do this incrementally.


(a) Sensor measurements (b) Hidden variables

Figure 12.1. Illustration of problem. Sensors measure chlorine in drinking water and show adaily, near sinusoidal periodicity during phases 1 and 3. During phase 2, some of the sensors are“stuck” due to a major leak. The extra hidden variable introduced during phase 2 captures thepresence of a new trend. SPIRIT can also tell us which sensors participate in the new, “abnormal”trend (e.g., close to a construction site). In phase 3, everything returns to normal.

We describe a motivating scenario, to illustrate the problem we want tosolve. Consider a large number of sensors measuring chlorine concentration ina drinkable water distribution network (see Figure 12.1, showing 15 days worthof data). Every five minutes, each sensor sends its measurement to a centralnode, which monitors and analyses the streams in real time.

The patterns in chlorine concentration levels normally arise from water de-mand. If water is not refreshed in the pipes, existing chlorine reacts with pipewalls and micro-organisms and its concentration drops. However, if freshwa-ter flows in at a particular location due to demand, chlorine concentration risesagain. The rise depends primarily on how much chlorine is originally mixedat the reservoirs (and also, to a small extent, on the distance to the closestreservoir—as the distance increases, the peak concentration drops slightly, dueto reactions along the way). Thus, since demand typically follows a periodicpattern, chlorine concentration reflects that (see Figure 12.1a, bottom): itishigh when demand is high and vice versa.

Assume that at some point in time, there is a major leak at some pipe inthe network. Since fresh water flows in constantly (possibly mixed with debrisfrom the leak), chlorine concentration at the nodes near the leak will be closeto peak at all times.

Figure 12.1a shows measurements collected from two nodes, one away fromthe leak (bottom) and one close to the leak (top). At any time, a human operatorwould like to know how many trends (orhidden variables) are in the data andask queries about them. Each hidden variable essentially corresponds toa groupof correlated streams.

In this simple example, SPIRIT discovers the correct number of hidden vari-ables. Under normal operation, only one hidden variable is needed, whichcorresponds to the periodic pattern (Figure 12.1b, top). Both observedvari-ables follow this hidden variable (multiplied by a constant factor, which is the

Dimensionality Reduction and Forecasting on Streams 263

participation weightof each observed variable into the particular hidden vari-able). Mathematically, the hidden variables are theprincipal componentsof theobserved variables and the participation weights are the entries of theprincipaldirectionvectors (more precisely, this is true under certain assumptions, whichwill be explained later).

However, during the leak, a second trend is detected and a new hidden vari-able is introduced (Figure 12.1b, bottom). As soon as the leak is fixed, thenumber of hidden variables returns to one. If we examine the hidden variables,the interpretation is straightforward: The first one still reflects the periodicde-mand pattern in the sections of the network under normal operation. All nodesin this section of the network have a participation weight of≈ 1 to the “periodictrend” hidden variable and≈ 0 to the new one. The second hidden variablerepresents the additive effect of the catastrophic event, which is to cancel out thenormal pattern. The nodes close to the leak have participation weights≈ 0.5to both hidden variables.

Summarising, SPIRIT can tell us the following (Figure 12.1): (i) Under nor-mal operation (phases 1 and 3), there is one trend. The correspondinghiddenvariable follows a periodic pattern and all nodes participate in this trend. Allis well. (ii) During the leak (phase 2), there is asecondtrend, trying to cancelthe normal trend. The nodes with non-zero participation to the correspondinghidden variable can be immediately identified (e.g., they are close to a construc-tion site). An abnormal event may have occurred in the vicinity of those nodes,which should be investigated.

Matters are further complicated when there are hundreds or thousands ofnodes and more than one demand pattern. However, as we show later, SPIRITis still able to extract the key trends from the stream collection, follow trenddrifts and immediately detect outliers and abnormal events. Besides providinga concise summary of key trends/correlations among streams, SPIRIT can suc-cessfully deal with missing values and its discovered hidden variables can beused to do very efficient, resource-economic forecasting.

There are several other applications and domains to which SPIRIT can beapplied. For example, (i) given more than 50,000 securities trading in US, onasecond-by-second basis, detect patterns and correlations [27], (ii)given trafficmeasurements [24], find routers that tend to go down together.

Contributions

The problem of pattern discovery in a large number of co-evolving streamshas attracted much attention in many domains. We introduceSPIRIT (Stream-ing Pattern dIscoveRy in multIple Time-series), a comprehensive approach todiscover correlations that effectively and efficiently summarise large collectionsof streams. SPIRIT satisfies the following requirements:


(i) It is streaming, i.e., it is incremental, scalable,any-time. It requires verymemory and processing time per time tick. In fact, both are independent of thestream lengtht.

(ii) It scaleslinearly with the number of streamsn, not quadratically. Thismay seem counter-intuitive, because the naıve method to spot correlationsacrossn streams examines allO(n2) pairs.

(iii) It is adaptive, and fullyautomatic. It dynamically detects changes (bothgradual, as well as sudden) in the input streams, and automatically determinesthe numberk of hidden variables.

The correlations and hidden variables we discover have multiple uses. Theyprovide a succinct summary to the user, they can help to do fast forecastingand detect outliers, and they facilitate interpolations and handling of missingvalues, as we discuss later.

The rest of the chapter is organized as follows: Section 1 discusses relatedwork, on data streams and stream mining. Section 2 and 3 overview some ofthe background. Section 5 describes our method and Section 6 shows howitsoutput can be interpreted and immediately utilized, both by humans, as wellas for further data analysis. Section 7 discusses experimental case studies thatdemonstrate the effectiveness of our approach. In Section 8 we elaborate onthe efficiency and accuracy of SPIRIT. Finally, in Section 9 we conclude.

1. Related work

Much of the work on stream mining has focused on finding interesting pat-terns in a single stream, but multiple streams have also attracted significantinterest. Ganti et al. [8] propose a generic framework for stream mining.10propose a one-passk-median clustering algorithm. 6 construct a decision treeonline, by passing over the data only once. Recently, 12 and 22 addressthe prob-lem of finding patterns over concept drifting streams. 19 proposed a methodto find patterns in a single stream, using wavelets. More recently, 18 considerapproximation of time-series withamnesicfunctions. They propose novel tech-niques suitable for streaming, and applicable to a wide range of user-specifiedapproximating functions.

15 propose parameter-free methods for classic data mining tasks (i.e., clus-tering, anomaly detection, classification), based on compression. 16 performclustering on different levels of wavelet coefficients of multiple time series.Both approaches require having all the data in advance. Recently, 2 propose aframework forPhenomena Detection and Tracking (PDT)in sensor networks.They define a phenomenon on discrete-valued streams and develop query execu-tion techniques based on multi-way hash join with PDT-specific optimizations.

CluStream (1) is a flexible clustering framework with online and offlinecomponents. The online component extends micro-cluster information (26)


by incorporating exponentially-sized sliding windows while coalescing micro-cluster summaries. Actual clusters are found by the offline component. Stat-Stream (27) uses the DFT to summarise streams within a finite window and thencompute the highest pairwise correlations among all pairs of streams, at eachtimestamp. BRAID (20) addresses the problem of discovering lag correlationsamong multiple streams. The focus is on time and space efficient methods forfinding the earliest and highest peak in the cross-correlation functions betweenall pairs of streams. Neither CluStream, StatStream or BRAID explicitly focuson discovering hidden variables.

9 improve on discovering correlations, by first doing dimensionality reduc-tion with random projections, and then periodically computing the SVD. How-ever, the method incurs high overhead because of the SVD re-computationand it can not easily handle missing values. Also related to these is the workof 4, which uses a different formulation of linear correlations and focuses oncompressing historical data, mainly for power conservation in sensor networks.MUSCLES (24) is exactly designed to do forecasting (thus it could handlemissing values). However, it can not find hidden variables and it scales poorlyfor a large number of streamsn, since it requires at least quadratic space andtime, or expensive reorganisation (selective MUSCLES).

Finally, a number of the above methods usually require choosing a slidingwindow size, which typically translates to buffer space requirements. Ourapproach does not require any sliding windows and does not need to buffer anyof the stream data.

In conclusion, none of the above methods simultaneously satisfy the require-ments in the introduction: “any-time” streaming operation, scalability on thenumber of streams, adaptivity, and full automation.

2. Principal component analysis (PCA)

Here we give a brief overview of PCA (13) and explain the intuition behindour approach. We use standard matrix algebra notation: vectors are lower-casebold, matrices are upper-case bold, and scalars are in plain font. The transposeof matrixX is denoted byXT . In the following,xt ≡ [xt,1 xt,2 · · · xt,n]T ∈ Rn

is the column-vector. of stream values at timet. We adhere to the commonconvention of using column vectors and writing them out in transposed form.The stream data can be viewed as a continuously growingt× n matrixXt :=[x1 x2 · · · xt]

T ∈ Rt×n, where one new row is added at each time tickt. Inthe chlorine example,xt is the measurements column-vector att over all thesensors, wheren is the number of chlorine sensors andt is the measurementtimestamp.

Typically, in collections ofn-dimensional pointsxt ≡ [xt,1 . . . , xt,n]T , t =1, 2, . . . , there exist correlations between then dimensions (which correspond


(a) Originalw1 (b) Update process (c) Resultingw1

Figure 12.2. Illustration of updatingw1 when a new pointxt+1 arrives.

to streams in our setting). These can be captured by principal componentsanalysis (PCA). Consider for example the setting in Figure 12.2. There is avisible linear correlation. Thus, if we represent every point with its projectionon the direction ofw1, the error of this approximation is very small. In fact,the first principal directionw1, is theoptimal in the following sense.

Definition 12.1 (First principal component) Given a collection ofn-dimensional vectorsxτ ∈ Rn, τ = 1, 2, . . . , t, thefirst principal directionw1 ∈ Rn is the vector minimizing the sum of squared residuals, i.e.,

w1 := arg min‖w‖=1

t∑

τ=1

‖xτ − (wwT )xτ‖2.

The projection ofxτ onw1 is thefirst principal component (PC)yτ,1 := wT1 xτ ,

τ = 1, . . . , t.

Note that, since‖w1‖ = 1, we have(w1wT1 )xτ = (wT

1 xτ )w1 = yτ,1w1 =:xτ , wherexτ is the projection ofyτ,1 back into the originaln-D space. Thatis, xτ is thereconstructionof the original measurements from the first PCyτ,1.More generally, PCA will producek vectorsw1,w2, . . . ,wk such that, if werepresent eachn-D data pointxt := [xt,1 · · · xt,n] with its k-D projectionyt := [wT

1 xt · · · wTk xt]

T , then this representation minimises the squared error∑τ ‖xt − xt‖2. Furthermore, the principal directions are orthogonal, so the

principal componentsyτ,i, 1 ≤ i ≤ k areby construction uncorrelated, i.e., ify(i) := [y1,i, . . . , yt,i, . . .]

T is the stream of thei-th principal component, then(y(i)

)Ty(j) = 0 if i 6= j.

Observation 2.1 (Dimensionality reduction) If we represent eachn-dimensional pointxτ ∈ Rn using alln principal components, then the error‖xτ − xτ‖ = 0. However, in typical datasets, we can achieve a very smallerror using onlyk principal components, wherek ≪ n.

In the context of the chlorine example, each point in Figure 12.2 wouldcorrespond to the 2-D projection ofxτ (where1 ≤ τ ≤ t) onto the first twoprincipal directions,w1 andw2, which are the most important according to the


Table 12.1. Description of notation.

Symbol Description

x, . . . Column vectors (lowercase boldface).A, . . . Matrices (uppercase boldface).

xt Then stream valuesxt := [xt,1 · · ·xt,n]T at timet.n Number of streams.wi Thei-th participation weight vector (i.e., principal direction).k Number of hidden variables.yt Vector of hidden variables (i.e., principal components) forxt, i.e.,

yt ≡ [yt,1 · · · yt,k]T := [wT1 xt · · ·w

Tk xt]

T .xt Reconstruction ofxt from thek hidden variable values, i.e.,

xt := yt,1w1 + · · · + yt,kwk.Et Total energy up to timet.Et,i Total energy captured by thei-th hidden variable, up to timet.fE , FE Lower and upper bounds on the fraction of energy we wish to maintain via

SPIRIT’s approximation.

distribution ofxτ | 1 ≤ τ ≤ t. The principal componentsyτ,1 andyτ,2 arethe coordinates of these projections in the orthogonal coordinate system definedby w1 andw2.

However, batch methods for estimating the principal components requiretime that depends on the durationt, which grows to infinity. In fact, the principaldirections are the eigenvectors ofXT

t Xt, which are best computed through thesingular value decomposition (SVD) ofXt. Space requirements also dependon t. Clearly, in a stream setting, it is impossible to perform this computationat every step, aside from the fact that we don’t have the space to storeall pastvalues. In short, we want a method that does not need to storeanypast values.

3. Auto-regressive models and recursive least squares

In this section we review some of the background on forecasting.

Auto-regressive (AR) modeling

Auto-regressive models are the most widely known and used—more infor-mation can be found in, e.g., (3). The main idea is to expressxt as a functionof its previous values, plus (filtered) noiseǫt:

xt = φ1xt−1 + . . . + φWxt−W + ǫt, (12.1)

whereW is a the forecasting window size. Seasonal variants (SAR, SAR(I)MA)also use window offsets that are multiples of a single, fixed period (i.e., besides


terms of the formyt−i, the equation contains terms of the formyt−Si whereSis a constant).

If we have a collection ofn time seriesxt,i, 1 ≤ i ≤ n then multivariate ARsimply expressesxt,i as a linear combination of previous values of all streams(plus noise), i.e.,

xt,i = φ1,1xt−1,1 + . . . + φ1,Wxt−W,1 +

. . .+

φn,1xt−1,n + . . . + φn,Wxt−W,n + ǫt. (12.2)

Recursive Least Squares (RLS)

Recursive Least Squares(RLS) is a method that allows dynamic update ofa least-squares fit. The least squares solution to an overdetermined systemof equationsXb = y whereX ∈ Rm×k (measurements),y ∈ Rm (outputvariables) andb ∈ Rk (regression coefficients to be estimated) is given bythe solution ofXTXb = XTy. Thus, all we need for the solution are theprojections

P ≡ XTX and q ≡ XTy

We need only spaceO(k2 + k) = O(k2) to keep the model up to date. Whena new rowxm+1 ∈ Rk and outputym+1 arrive, we can update

P← P + xm+1xTm+1 and

q← q + ym+1xm+1.

In fact, it is possible to update the regression coefficient vectorb without ex-plicitly inverting P to solvePb = P−1q. In particular (see, e.g., (25)) theupdate equations are

G← G− (1 + xTm+1Gxm+1)

−1Gxm+1xTm+1G (12.3)

b← b−Gxm+1(xTm+1b− ym+1), (12.4)

where the matrixG can be initialized toG← ǫI, with ǫ a small positive numberandI thek × k identity matrix.

RLS and AR In the context of auto-regressive modeling (Eq. 12.1), we haveone equation for each stream valuexw+1, . . . , xt, . . ., i.e., them-th row of theX matrix above is

Xm = [xm−1 xm−2 · · · xm−w]T ∈ Rw

andzm = xm, for t − w = m = 1, 2, . . . (t > w). In this case, the solutionvectorb consists precisely of the auto-regression coefficients in Eq. 12.1, i.e.,

b = [φ1 φ2 · · · φw]T ∈ Rw.

RLS can be similarly used for multivariate AR model estimation.


4. MUSCLES

MUSCLES (MUlti-SequenCe LEast Squares)(24) tries to predict the value ofone stream,xt,i based on the previous values from all streams,xt−l,j , l > 1, 1 ≤j ≤ n and current values from other streams,xt,j , j 6= i. It uses multivariateautoregression, thus the predictionxt,i for a given streami is, similar to Eq. 12.2

xt,i = φ1,0xt,1 + φ1,1xt−1,1 + . . . + φ1,Wxt−W,1 +

. . . +

φi−1,0xt−1,i−1 + φi−1,1xt−1,i−1 + . . . + φi−1,wxt−W,i−1 +

φi,1xt−1,i + . . . + φi,wxt−W,i +

φi+ 1, 0xt,i+1 + φi+1,1xt−1,i+1 + . . . + φi+1,wxt−W,i+1 +

. . .+

φn,0xt,n + φn,1xt−1,n + . . . + φn,Wxt−W,n + ǫt.

and employs RLS to continuously update the coefficientsφi,j such that theprediction error

t∑

τ=1

(xτ,i − xτ,i)2

is minimized. Note that the above equation has one dependent variable (theestimatext,i) andv = W ∗n+n− 1 independent variables (the past values ofall streams plus the current values of all other streams excepti).

Exponentially forgetting MUSCLESemploys a forgetting factor0 < λ ≤ 1and minimizes instead

t∑

τ=1

λt−τ (xτ,i − xτ,i)2.

For λ < 1, errors for old values are downplayed by a geometric factor, andhence it permits the estimate to adapt as sequence characteristics change.

Selective MUSCLES

In case we have too many time sequences (e.g.,n = 100, 000 nodes ina network, producing information about their load every minute), even theincremental version of MUSCLES will suffer. The solution we propose is basedon the conjecture that we do not really need information from every sequence tomake a good estimation of a missing value much of the benefit of using multiplesequences may be captured by using only a small number of carefully selectedother sequences. Thus, we propose to do some preprocessing of a training set,to find a promising subset of sequences, and to apply MUSCLES only to thosepromising ones (hence the name Selective MUSCLES).


Assume that sequencei is the one notoriously delayed and we need to esti-mate its “delayed” valuesxt,i. For a given tracking window spanW , amongthe v = W ∗ n + n − 1 independent variables, we have to choose the onesthat are most useful in estimating the delayed value ofxt,i. More generally, wewant to solve the following

Problem 4.1 (Subset selection) Givenv independent variablesx1, x2, . . . , xv and a dependent variabley withN samples each, find the bestb (< v) independent variables to minimize the mean-square error fory for thegiven samples.

We need a measure of goodness to decide which subset ofb variables is thebest we can choose. Ideally, we should choose the best subset that yields thesmallest estimation error in the future. Since, however, we don’t have futuresamples, we can only infer theexpected estimation error(EEE for short) fromthe available samples as follows:

EEE(S) =N∑

t=1

(y[t]− yS [t])2

whereS is the selected subset of variables andyS [t] is the estimation based onSfor thet-th sample. Note that, thanks to Eq. 12.3,EEE(S) can be computed inO(N ·‖S‖2) time. Let’s say that we are allowed to keep onlyb = 1 independentvariable. Which one should we choose? Intuitively, we could try the one thathas the highest (in absolute value) correlation coefficient withy. It turns outthat this is indeed optimal: (to satisfy the unit variance assumption, we willnormalize samples by the sample variance within the window.)

Lemma 12.2 Given a dependent variabley, andv independent variables withunit variance, the best single variable to keep to minimizeEEE(S) is the onewith the highest absolute correlation coefficient withy.

Proof For a single variable, ifa is the least squares solution, we can expressthe error in matrix form as

EEE(xi) = ‖y‖2 − 2a(yTxi) + a2‖xi‖2.

Letdandpdenote‖xi‖2 and(xTy), respectively. Sincea = d−1p,EEE(xi) =‖y‖2 − p2d−1. To minimize the error, we must choosexi which maximizep2

and minimized. Assuming unit-variance (d = 1), suchxi is the one with thebiggest correlation coefficient toy. This concludes the proof.

The question is how we should handle the case whenb > 1. Normally, weshould consider all the possible groups ofb independent variables, and try to


pick the best. This approach explodes combinatorially; thus we propose touse a greedy algorithm. At each steps, we select the independent variablexs that minimizes theEEE for the dependent variabley, in light of thes − 1independent variables that we have already chosen in the previous steps.

Bottleneck of the algorithm is clearly the computation ofEEE. Since itcomputesEEE approximatelyO(v · b) times and each computation ofEEErequiresO(N ·b2) in average, the overall complexity mounts toO(N ·v ·b3). Toreduce the overhead, we observe that intermediate results produced for EEE(S)can be re-used forEEE(S ∪ x).

Lemma 12.3 The complexity of the greedy selection algorithm isO(N ·v ·b2).

Proof Let S+ beS ∪ x. The core in computingEEE(S+) is the inverse ofDS+ = (XT

S+XS+). Thanks to block matrix inversion formula (14) (p. 656)and the availability ofD−1

S from the previous iteration step, it can be computedin O(N · |S|+ |S|2). Hence, summing it up overv − |S| remaining variablesfor eachb iteration, we haveO(N · v · b2 + v · b3) complexity. SinceN ≫ b,it reduces toO(N · v · b2).

We envision that the subset-selection will be done infrequently and off-line, sayeveryN = W time-ticks. The optimal choice of the reorganization windowWis beyond the scope of this paper. Potential solutionsinclude (a) doing reor-ganization during off-peak hours, (b) triggering a reorganization whenever theestimation error for by increases above an application-dependent threshold etc.Also, by normalizing the training set, the unit-variance assumption in Theorem1 can be easily satisfied.

5. Tracking correlations and hidden variables: SPIRIT

In this section we present our framework for discovering patterns in multiplestreams. In the next section, we show how these can be used to perform ef-fective, low-cost forecasting. We use auto-regression for its simplicity,but ourframework allows any forecasting algorithm to take advantage of the compactrepresentation of the stream collection.

Problem definition Given a collection ofnco-evolving, semi-infinite streams,producing a valuext,j , for every stream1 ≤ j ≤ n and for every time-tickt = 1, 2, . . ., SPIRIT does the following: (i) Adapts the numberk of hiddenvariablesnecessary to explain/summarise the main trends in the collection. (ii)Adapts theparticipation weightswi,j of thej-th stream on thei-th hidden vari-able (1 ≤ j ≤ n and1 ≤ i ≤ k), so as to produce an accurate summary of thestream collection. (iii) Monitors the hidden variablesyt,i, for 1 ≤ i ≤ k. (iv)Keeps updating all the above efficiently.


More precisely, SPIRIT operates on the column-vectors of observed streamvaluesxt ≡ [xt,1, . . . , xt,n]T and continually updates the participation weightswi,j . The participation weight vectorwi for the i-th principal direction iswi := [wi,1 · · · wi,n]T . The hidden variablesyt ≡ [yt,1, . . . , yt,k]

T are theprojections ofxt onto eachwi, over time (see Table 12.1), i.e.,

yt,i := wi,1xt,1 + wi,2xt,2 + · · ·+ wi,nxt,n,

SPIRIT also adapts the numberk of hidden variables necessary to capture mostof the information. The adaptation is performed so that the approximationachieves a desired mean-square error. In particular, letxt = [xt,1 · · · xt,n]T bethereconstructionof xt, based on the weights and hidden variables, defined by

xt,j := w1,jyt,1 + w2,jyt,2 + · · ·+ wk,jyt,k,

or more succinctly,xt =∑k

i=1 yi,twi.In the chlorine example,xt is then-dimensional column-vector of the orig-

inal sensor measurements andyt is the hidden variable column-vector, both attime t. The dimension ofyt is 1 before/after the leak (t < 1500 or t > 3000)and 2 during the leak (1500 ≤ t ≤ 3000), as shown in Figure 12.1.

Definition 12.4 (SPIRIT tracking) SPIRIT updates the participationweightswi,j so as to guarantee that the reconstruction error‖xt − xt‖2 overtime is predictably small.

This informal definition describes what SPIRIT does. The precise criteria re-garding the reconstruction error will be explained later. If we assume thatthext are drawn according to some distribution that does not change over time(i.e., understationarityassumptions), then the weight vectorswi converge tothe principal directions. However, even if there are non-stationarities in thedata (i.e., gradual drift), in practice we can deal with these very effectively, aswe explain later.

An additional complication is that we often have missing values, for severalreasons: either failure of the system, or delayed arrival of some measurements.For example, the sensor network may get overloaded and fail to report some ofthe chlorine measurements in time or some sensor may temporarily black-out.At the very least, we want to continue processing the rest of the measurements.

Tracking the hidden variables

The first step is, for a givenk, to incrementally update thek participationweight vectorswi, 1 ≤ i ≤ k, so as to summarise the original streams withonly a few numbers (the hidden variables). In Section 5.0, we describe thecomplete method, which also adaptsk.


For the moment, assume that the number of hidden variablesk is given.Furthermore, our goal is to minimise the average reconstruction error

∑t ‖xt−

xt‖2. In this case, the desired weight vectorswi, 1 ≤ i ≤ k are the principaldirections and it turns out that we can estimate them incrementally.

We use an algorithm based on adaptive filtering techniques (23, 11), whichhave been tried and tested in practice, performing well in a variety of settingsandapplications (e.g., image compression and signal tracking for antenna arrays).We experimented with several alternatives (17, 5) and found this particularmethod to have the best properties for our setting: it is very efficient in termsof computational and memory requirements, while converging quickly, with nospecial parameters to tune. The main idea behind the algorithm is to read in thenew valuesxt+1 ≡ [x(t+1),1, . . . , x(t+1),n]T from then streams at timet + 1,and perform three steps:

1 Compute the hidden variablesy′t+1,i, 1 ≤ i ≤ k, based on thecurrentweightswi, 1 ≤ i ≤ k, by projectingxt+1 onto these.

2 Estimate the reconstruction error (ei below) and the energy, based on they′t+1,i values.

3 Update the estimates ofwi, 1 ≤ i ≤ k and output theactual hiddenvariablesyt+1,i for time t+ 1.

To illustrate this, Figure 12.2b shows thee1 andy1 when the new dataxt+1 enterthe system. Intuitively, the goal is to adaptively updatewi so that it quicklyconverges to the “truth.” In particular, we want to updatewi more whenei islarge. However, the magnitude of the update should also take into account thepast data currently “captured” bywi. For this reason, the update is inverselyproportional to the currentenergyEt,i of the i-th hidden variable, which isEt,i := 1

t

∑tτ=1 y

2τ,i. Figure 12.2c showsw1 after the update forxt+1.

Algorithm TrackW0. Initialise thek hidden variableswi to unit vectorsw1 = [10 · · · 0]T , w2 =[010 · · · 0]T , etc. Initialisedi (i = 1, ...k) to a small positive value. Then:1. As each pointxt+1 arrives, initialisex1 := xt+1.2. For1 ≤ i ≤ k, we perform the following assignments and updates, in order:

yi := wTi xi (yt+1,i = projection ontowi)

di ← λdi + y2i (energy∝ i-th eigenval. ofXT

t Xt)

ei := xi − yiwi (error,ei ⊥ wi)

wi ← wi +1

diyiei (update PC estimate)

xi+1 := xi − yiwi (repeat with remainder ofxt).


Theforgetting factorλwill be discussed in Section 5.0 (for now, assumeλ = 1).For eachi, di = tEt,i and xi is the component ofxt+1 in the orthogonalcomplement of the space spanned by the updated estimateswi′ , 1 ≤ i′ < i ofthe participation weights. The vectorswi, 1 ≤ i ≤ k are in order of importance(more precisely, in order of decreasing eigenvalue or energy). It can be shownthat, under stationarity assumptions, thesewi in these equations converge tothe true principal directions.

Complexity We only need to keep thek weight vectorswi (1 ≤ i ≤ k), eachn-dimensional. Thus the total cost isO(nk), both in time and of space. Theupdate cost does not depend ont. This is a tremendous gain, compared to theusual PCA computation cost ofO(tn2).

Detecting the number of hidden variables

In practice, we do not know the numberk of hidden variables. We proposeto estimatek on the fly, so that we maintain a high percentagefE of theenergyEt. Energy thresholding is a common method to determine how many principalcomponents are needed (13). Formally, the energyEt (at timet) of the sequenceof xt is defined as

Et := 1t

∑tτ=1 ‖xτ‖2 = 1

t

∑tτ=1

∑ni=1 x

2τ,i.

Similarly, the energyEt of the reconstructionx is defined as

Et := 1t

∑tτ=1 ‖xτ‖2.

Lemma 12.5 Assuming thewi, 1 ≤ i ≤ k are orthonormal, we have

Et = 1t

∑tτ=1 ‖yτ‖2 = t−1

t Et−1 + 1t ‖yt‖.

Proof If thewi, 1 ≤ i ≤ k are orthonormal, then it follows easily that‖xτ‖2 =‖yτ,1w1+· · ·+yτ,kwk‖2 = y2

τ,1‖w1‖2+· · ·+y2τ,k‖wk‖2 = y2

τ,1+· · ·+y2τ,k =

‖yτ‖2 (Pythagorean theorem and normality). The result follows by summingoverτ .

It can be shown that algorithmTrackW maintains orthonormality withoutthe need for any extra steps (otherwise, a simple re-orthonormalisation stepatthe end would suffice).

From the user’s perspective, we have a low-energy and a high-energy thresh-old, fE andFE , respectively. We keep enough hidden variablesk, so theretained energy is within the range[fE ·Et, FE ·Et]. Whenever we get outsidethese bounds, we increase or decreasek. In more detail, the steps are:


1 Estimate the full energyEt+1, incrementally, from the sum of squares ofxτ,i.

2 Estimate the energyE(k) of thek hidden variables.

3 Possibly, adjustk. We introduce a new hidden variable (updatek ← k+1)if the current hidden variables maintain too little energy, i.e.,E(k) < fEE.We drop a hidden variable (updatek ← k− 1), if the maintained energyis too high, i.e.,E(k) > FEE.

The energy thresholdsfE andFE are chosen according to recommendationsin the literature (13, 7). We use a lower energy thresholdfE = 0.95 and anupper thresholdFE = 0.98. Thus, the reconstructionxt retains between 95%and 98% of the energy ofxt.

Algorithm SPIRIT0. Initialisek ← 1 and the total energy estimates ofxt andxt per time tick toE ← 0 andE1 ← 0. Then,1. As each new point arrives, updatewi, for 1 ≤ i ≤ k (step 1,TrackW).2. Update the estimates (for1 ≤ i ≤ k)

E ← (t− 1)E + ‖xt‖2t

and Ei ←(t− 1)Ei + y2

t,i

t.

3. Let the estimate of retained energy be

E(k) :=∑k

i=1 Ei.

If E(k) < fEE, then we start estimatingwk+1 (initialising as in step 0 of

TrackW), initialise Ek+1 ← 0 and increasek ← k + 1. If E(k) > FEE,

then we discardwk andEk and decreasek ← k − 1.

The following lemma proves that the above algorithm guarantees the relativereconstruction error is within the specified interval[fE , FE ].

Lemma 12.6 The relative squared error of the reconstruction satisfies

1− FE ≤∑t

τ=1 ‖xτ − xτ‖2∑t ‖xτ‖2

≤ 1− fE .

Proof From the orthogonality ofxτ and xτ − xτ we have‖xτ − xτ‖2 =‖xτ‖2 − ‖xτ‖2 = ‖xτ‖2 − ‖yτ‖2 (by Lemma 12.5). The result follows bysumming overτ and from the definitions ofE andE.

In Section 8.0 we demonstrate that the incremental weight estimates are ex-tremely close to the principal directions computed with offline PCA.


Exponential forgetting

We can adapt to more recent behaviour by using an exponential forgettingfactor,0 < λ < 1. This allows us to follow trend drifts over time. We use thesameλ for the estimation of bothwi and of the AR models (see Section 6.0).However, we also have to properly keep track of the energy, discounting it withthe same rate, i.e., the update at each step is:

E ← λ(t− 1)E + ‖xt‖2t

and Ei ←λ(t− 1)Ei + y2

t,i

t.

Typical choices are0.96 ≤ λ ≤ 0.98 (11). As long as the values ofxt do notvary wildly, the exact value ofλ is not crucial. We useλ = 0.96 throughout.A value ofλ = 1 makes sense when we know that the sequence is stationary(rarely true in practice, as most sequences gradually drift). Note that thevalueof λ does not affect the computation cost of our method. In this sense, anexponential forgetting factor is more appealing than a sliding window, as thelatter has explicit buffering requirements.

6. Putting SPIRIT to work

We show how we can exploit the correlations and hidden variables discoveredby SPIRIT to do (a) forecasting, (b) missing value estimation, (c) summarisationof the large number of streams into a small, manageable number of hiddenvariables, and (d) outlier detection.

Forecasting and missing values

The hidden variablesyt give us a much more compact representation of the“raw” variablesxt, with guarantees of high reconstruction accuracy (in termsof relative squared error, which is less than1− fE). When our streams exhibitcorrelations, as we often expect to be the case, the numberk of the hiddenvariables is much smaller than the numbern of streams. Therefore, we canapplyanyforecasting algorithm to the vector of hidden variablesyt, instead ofthe raw data vectorxt. This reduces the time and space complexity by ordersof magnitude, because typical forecasting methods are quadratic or worse onthe number of variables.

In particular, we fit the forecasting model on theyt instead ofxt. The modelprovides an estimateyt+1 = f(yt) and we can use this to get an estimate for

xt+1 := yt+1,1w1[t] + · · ·+ yt+1,1wk[t],

using the weight estimateswi[t] from the previous time tickt. We chose auto-regression for its intuitiveness and simplicity, but any online method can beused.


Correlations Since the principal directions are orthogonal (wi ⊥ wj , i 6= j),the components ofyt areby construction uncorrelated—the correlations havealready been captured by thewi, 1 ≤ i ≤ k. We can take advantage of thisde-correlation reduce forecasting complexity. In particular for auto-regression,we found that one AR model per hidden variable provides results comparableto multivariate AR.

Auto-regression Space complexity for multivariate AR (e.g., MUSCLES (24))isO(n3ℓ2), whereℓ is the auto-regression window length. For AR per stream(ignoring correlations), it isO(nℓ2). However, for SPIRIT, we needO(kn)space for thewi and, with one AR model peryi, the total space complexity isO(kn + kℓ2). As published, MUSCLES requires space that grows cubicallywith respect to the number of streamsn. We believe it can be made to work withquadratic space, but this is still prohibitive. Both AR per stream and SPIRITrequire space that grows linearly with respect ton, but in SPIRITk is typicallyvery small (k ≪ n) and, in practice, SPIRIT requires less memory and timeper update than AR per stream. More importantly, a single, independent ARmodel per stream cannot captureanycorrelations, whereas SPIRIT indirectlyexploits the correlations presentwithin a time tick.

Missing values When we have a forecasting model, we can use the forecastbased onxt−1 to estimate missing values inxt. We then use these estimatedmissing values to update the weight estimates, as well as the forecasting models.Forecast-based estimation of missing values is the most time-efficient choiceand gives very good results.

Interpretation

At anygiven timet, SPIRIT readily provides two key pieces of information(aside from the forecasts, etc.): (i)The number of hidden variablesk. (ii)The weightswi,j , 1 ≤ i ≤ k, 1 ≤ j ≤ n. Intuitively, the magnitude|wi,j |of each weight tells us how much thei-th hidden variable contributes to thereconstruction of thej-th stream.

In the chlorine example during phase 1 (see Figure 12.1), the dataset hasonlyone hidden variable, because one sinusoidal-like pattern can reconstruct bothstreams (albeit with different weights for each). Thus, SPIRIT correctly iden-tifies correlated streams. When the correlation was broken, SPIRIT introducesenough hidden variables to capture that. Finally, it also spots that, in phase 3,normal operation is reestablished and thus disposes of the unnecessaryhiddenvariable. Section 7 has additional examples of how we can intuitively interpretthis information.


Table 12.2. Description of datasets.

Dataset n k Description

Chlorine 166 2 Chlorine concentrations from EPANET.Critter 8 1–2 Temperature sensor measurements.River 3 1 River gauge data from USACE.Motes 54 2–4 Light sensor measurements.

7. Experimental case studies

In this section we present case studies on real and realistic datasets to demon-strate the effectiveness of our approach in discovering the underlyingcorrela-tions among streams. In particular, we show that: (i) We capture the appropriatenumber of hidden variables. As the streams evolve, we capture these changesin real-time (21) and adapt the number of hidden variablesk and the weightswi. (ii) We capture the essential behaviour with very few hidden variablesand small reconstruction error. (iii) We successfully deal with missing values.(iv) We can use the discovered correlations to perform good forecasting, withmuchfewer resources. (v) We can easily spot outliers. (vi) Processing time perstream is constant. Section 8 elaborates on performance and accuracy.

Chlorine concentrations

Description TheChlorine dataset was generated by EPANET 2.01 that ac-curately simulates the hydraulic and chemical phenomena within drinking waterdistribution systems. Given a network as the input, EPANET tracks the flow ofwater in each pipe, the pressure at each node, the height of water in each tank,and the concentration of a chemical species throughout the network, during asimulation period comprised of multiple timestamps. We monitor the chlorineconcentration level at all the 166 junctions of a water distribution network, for4310 timestamps during 15 days (one time tick every five minutes). The datawas generated by using the input network with the demand patterns, pressures,flows specified at each node.

Data characteristics The two key features are: (i) A clear global periodicpattern (daily cycle, dominating residential demand pattern). Chlorine concen-trations reflect this, with few exceptions. (ii) A slight time shift across differentjunctions, which is due to the time it takes for fresh water to flow down thepipes from the reservoirs. Thus, most streams exhibit the same sinusoidal-

1http://www.epa.gov/ORD/NRMRL/wswrd/epanet.html


0 100 200 300 400 500

0.00.2

0.40.6

0.81.0

Original measurements

0.00.2

0.40.6

0.81.0

Reconstruction

0 500 1000 1500 2000

−10

12

34

time

hidden

vars

(a) Measurements and SPIRIT reconstruction (b) Hidden variables

Figure 12.3. Chlorine dataset: (a) actual measurements and reconstruction at four junctions.We plot only 500 consecutive timestamps (the patterns repeat after that).(b) shows SPIRIT’shidden variables.

like pattern, except with gradual phase shifts as we go further away from thereservoir.

Results of SPIRIT SPIRIT can successfully summarise the data using justtwo numbers (hidden variables) per time tick, as opposed to the original 166numbers. Figure 12.3a shows the reconstruction for four of the sensors (out of166). Only two hidden variables give very good reconstruction.

Interpretation The two hidden variables (Figure 12.3b) reflect the two keydataset characteristics. The first hidden variable captures the global, periodicpattern. The second one also follows a very similar periodic pattern, but witha slight “phase shift.” It turns out that the two hidden variables together aresufficient to express (via a linear combination) any other time series with anarbitrary “phase shift.”

Light measurements

Description TheMotes dataset consists of light intensity measurements col-lected using Berkeley Mote sensors, at several different locations in alab, overa period of a month.

Data characteristics The main characteristics are: (i) A clear global periodicpattern (daily cycle). (ii) Occasional big spikes from some sensors (outliers).

Results of SPIRIT SPIRIT detects four hidden variables (see Figure 12.4b).Two of these are intermittent and correspond to outliers, or changes in the cor-related trends. We show the reconstructions for some of the observed variablesin Figure 12.4a.

Interpretation In summary, the first two hidden variables (see Figure 12.4b)correspond to the global trend and the last two, which are intermittently present,correspond to outliers. In particular, the first hidden variable capturesthe global


0 500 1000 1500 2000

050

010

0015

00

measurements + reconstruction of sensor 31

050

010

0015

00

measurements + reconstruction of sensor 320 500 1000 1500 2000

020

0040

0060

0080

00

hidd

en v

ar 1

0 500 1000 1500 2000

−300

0−1

000

010

0020

00

hidd

en v

ar 2

010

0020

0030

00

hidd

en v

ar 3

050

010

0015

0020

0025

00

hidd

en v

ar 4

(a) Original measurements vs. reconstruction (b) Hidden variables

Figure 12.4. Mote dataset: (a) shows the measurements (bold) and reconstruction (thin) onnodes 31 and 32. (b) the third and fourth hidden variables are intermittentand indicate anomalousbehaviour (axes limits are different in each plot).

periodic pattern. The interpretation of the second one is again similar to theChlorine dataset. The first two hidden variables together are sufficient toexpress arbitrary phase shifts. The third and fourth hidden variables indicatesome of the potential outliers in the data. For example, there is a big spike inthe 4th hidden variable at timet = 1033, as shown in Figure 12.4b. Examiningthe participation weightsw4 at that timestamp, we can find the correspondingsensors “responsible” for this anomaly, i.e., those sensors whose participationweights have very high magnitude. Among these, the most prominent aresensors 31 and 32. Looking at the actual measurements from these sensors, wesee that before timet = 1033 they are almost 0. Then, very large increasesoccur aroundt = 1033, which bring an additional hidden variable into thesystem.

Room temperatures

Description TheCritter dataset consists of 8 streams (see Figure 12.5a).Each stream comes from a small sensor2 (aka. Critter) that connects to thejoystick port and measures temperature. The sensors were placed in 5 neigh-bouring rooms. Each time tick represents the average temperature during oneminute.

Furthermore, to demonstrate how the correlations capture information aboutmissing values, we repeated the experiment after blanking 1.5% of the values(five blocks ofconsecutivetimestamps; see Figure 12.6).

Data characteristics Overall, the dataset does not seem to exhibit a cleartrend. Upon closer examination, all sensors fluctuate slightly around a con-stant temperature (which ranges from 22–27oC, or 72–81oF, depending on the

2http://www.ices.cmu.edu/sensornets/


0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000018

23

28

Sens

or1(o C

)

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

20

23

26

Sens

or2(o C

)

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000021

23

26

Sens

or3(o C

)

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

19

23

25

Sens

or4(o C

)

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000023

25

28

Sens

or5(o C

)

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000023

25

28

Sens

or6(o C

)

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

22

24

26

Sens

or7(o C

)

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000020

24

27

time

Sens

or8(o C

)

ActualSPIRIT

Critter − SPIRIT

(a) SPIRIT reconstruction

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000018

23

28Critter − SPIRIT vs. Repeated PCA (reconstruction)

Se

nso

r 1 (

oC

)

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000018

23

28

Se

nso

r 2(o

C)

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

23

28

Se

nso

r 5(o

C)

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

23

28

time

Se

nso

r 8(o

C)

Repeated PCASPIRIT

(b) Reconstructionsxt

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0

2

4

6

Critter − Hidden variables

Hid

den

var y

1,t

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0

2

4

Hid

den

var y

2,t

(c) Hidden variables

Figure 12.5. Critter data and SPIRIT output, for each of the temperature sensors, in (a).SPIRIT can track the overall behaviour of the entire stream collection with only two hiddenvariables, shown in (c). For the comparison in (b), wall clock times are 1.5 minutes (repeatedPCA) versus 7 seconds (SPIRIT).

sensor). Approximately half of the sensors exhibit a more similar “fluctuationpattern.”

Results of SPIRIT SPIRIT discovers one hidden variable, which is sufficientto capture the general behaviour. However, if we utilise prior knowledge(suchas, e.g., that the pre-set temperature was23oC), we can ask SPIRIT to detecttrends with respect to that. In that case, SPIRIT comes up with two hiddenvariables, which we explain later.

SPIRIT is also able to deal successfully with missing values in the streams.Figure 12.6 shows the results on the blanked version (1.5% of the total values infive blocks ofconsecutivetimestamps, starting at a different position for eachstream) ofCritter. The correlations captured by SPIRIT’s hidden variableoften provide useful information about the missing values. In particular, onsensor 8 (second row, Figure 12.6), the correlations picked by thesinglehiddenvariable successfully capture the missing values in that region (consisting of270 ticks). On sensor 7, (first row, Figure 12.6; 300 blanked values), the upwardtrend in the blanked region is also picked up by the correlations. Even thoughthe trend is slightly mis-estimated, as soon as the values are observed again,SPIRIT very quickly gets back to near-perfect tracking.


8200 8400 8600 880023

24

25

26

27

Sens

or7 (o C)

Critter − SPIRIT

8200 8400 8600 8800

23

24

25

26

27Critter (blanked) − SPIRIT

2000 2100 2200 2300 2400 250021.5

22

22.5

time

Sens

or8 (o C)

2000 2100 2200 2300 2400 250021.5

22

22.5

time

ActualSPIRIT

blanked region

blanked region

Figure 12.6. Detail of the forecasts onCritter with blanked values. The second row showsthat the correlations picked by thesinglehidden variable successfully capture the missing valuesin that region (consisting of 270consecutiveticks). In the first row (300 consecutive blankedvalues), the upward trend in the blanked region is also picked up by the correlations to otherstreams. Even though the trend is slightly mis-estimated, as soon as the values are observedagain SPIRIT quickly gets back to near-perfect tracking.

Interpretation If we examine the participation weights inw1, the largestcorrespond primarily to streams 5 and 6, and then to stream 8. If we examinethe data, sensors 5 and 6 consistently have the highest temperatures. Sensor 8also has a similar temperature most of the time.

However, if the sensors are calibrated based on the fact that these arebuildingtemperature measurements, where we have set the thermostat to23oC (73oF),then SPIRIT discovers two hidden variables (see Figure 12.5c). More specif-ically, if we reasonably assume that we have the prior knowledge of what thetemperatureshould be(note that this has nothing to do with the average tem-perature in the observed data) and want to discover what happens around thattemperature, we can subtract it from each observation and SPIRIT will discoverpatterns and anomalies based on this information. Actually, this is what a hu-man operator would be interested in discovering: “Does the system work asIexpect it to?” (based on my knowledge of how it should behave) and “Ifnot,what is wrong?” and we indeed discover this kind of information.

The interpretation of the first hidden variable is similar to that of the originalsignal: sensors 5 and 6 (and, to a lesser extent, 8) deviate from that temperaturethe most, for most of the time. Maybe the thermostats are broken or set wrong?

Forw2, the largest weights correspond to sensors 1 and 3, then to 2 and 4. Ifwe examine the data, we notice that these streams follow a similar, fluctuatingtrend (close to the pre-set temperature), the first two varying more violently.The second hidden variable is added at timet = 2016. If we examine the plots,we see that, at the beginning, most streams exhibit a slow dip and then ascent(e.g., see 2, 4 and 5 and, to a lesser extent, 3, 7 and 8). However, a numberof them start fluctuating more quickly and violently when the second hiddenvariable is added.


0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 104

0

5

10

15Su

tersvi

lle

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 104

0

5

10

15

Conn

elsvill

e

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 104

0

5

10

15

time

Conflu

ence

ActualVIPER

River − VIPER

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 104

0

5

10

15

20River − Hidden variable

Hid

de

n v

ar

y 1,t

time

(a) SPIRIT reconstruction (with forecasting) (b) Hidden variable

Figure 12.7. Actual River data (river gauges, in feet) and SPIRIT output, for each of thestreams (no pun intended). The large portions with missing values acrossall streams are markedwith dotted lines (there are also other missing values in some of the streams);about 26% of allvalues are missing, but this does not affect SPIRIT’s tracking abilities.

River gauges

Description The dataset was collected from the USACE current river con-ditions website3. It consists of river stage (or, water level) data from threedifferent measuring stations in the same river system (see Figure 12.7).

Data characteristics The data exhibit one common trend and has plenty ofmissing values (26% of all values, for all three streams).

Results and interpretation Examining the three hidden variable weightsfound by SPIRIT, these have ratios1.5 : 1.1 : 1. Indeed, if we look at all20,000 time ticks, this is what we see; all streams are very similar (since theyare from the same river), with the “amplitude” of the fluctuations having roughlythese proportions. Hence, one hidden variable is sufficient, the three weightscompactly describe the key information and the interpretation is intuitive.

Besides recovering missing values from underlying correlations captured bythe few hidden variables, SPIRIT’s tracking abilities are not affected even inextreme cases.

8. Performance and accuracy

In this section we discuss performance issues. First, we show that SPIRITrequires very limited space and time. Next, we elaborate on the accuracy ofSPIRIT’s incremental estimates.

3http://wmw.lrp.usace.army.mil/current/


2000 3000 4000 5000 6000 7000 8000 9000 10000

1

2

3

4

5

6

7tim

e (s

ec)

Time vs. stream size

stream size (time ticks)100 200 300 400 500 600 700 800 900 1000

0.8

0.9

1

1.1

1.2

1.3

time

(sec

)

Time vs. #streams

# of streams (n)2 3 4 5 6 7 8 9 10

1

1.5

2

2.5

3

time

(sec

)

# eigenvectors (k)

Time vs. k

(a) Stream sizet (b) Streamsn (c) Hidden variablesk

Figure 12.8. Wall-clock times (including time to update forecasting models). Times for ARand MUSCLES are not shown, since they are off the charts from the start (13.2 seconds in (a)and 209 in (b)). The starting values are: (a) 1000 time ticks, (b) 50 streams, and (c) 2 hiddenvariables (the other two held constant for each graph). It is clear that SPIRIT scales linearly.

Time and space requirements

Figure 12.8 shows that SPIRIT scales linearly with respect to number ofstreamsn and number of hidden variablesk. AR per stream and MUSCLESare essentially off the charts from the very beginning. Furthermore, SPIRITscales linearly with stream size (i.e., requires constant processing time pertuple).

The plots were generated using a synthetic dataset that allows us to First,we choose the numberk of trends and generate sine waves with different fre-quencies, sayyt,i = sin(2πi/kt), 1 ≤ i ≤ k. Thus, all trends are pairwiselinearly independent. Next, wenerate each of then streams as random linearcombinations of thesek trend signals. This scheme allows us to varyk, n andthe length of the streams at will. For each experiment shown, one of these pa-rameters is varied and the other two are held fixed. The numbers in Figure 12.8are wall-clock times of our Matlab implementation. Both AR-per-stream aswell as MUSCLES (also in Matlab) are several orders of magnitude slowerandthus omitted.

We have also implemented the SPIRIT algorithms in a real system (21), whichcan obtain measurements from sensor devices and display hidden variables andtrends in real-time.

Accuracy

In terms of accuracy, everything boils down to the quality of the summaryprovided by the hidden variables. To this end, we show the reconstructionxt of xt, from the hidden variablesyt in Figure 12.5(b). One line uses thetrue principal directions, the other the SPIRIT estimates (i.e., weight vectors).SPIRIT comes very close to repeated PCA.

We should note that this is an unfair comparison for SPIRIT, since repeatedPCA requires (i) storingall stream values, and (ii) performing a very expensiveSVD computation foreachtime tick. However, the tracking is still very good.


Table 12.3. Reconstruction accuracy (mean squared error rate).

Dataset Chlorine Critter Motes

MSE rate (SPIRIT) 0.0359 0.0827 0.0669MSE rate (repeated PCA) 0.0401 0.0822 0.0448

200 400 600 800 10000

0.2

0.4

0.6

0.8

1

|cos

(v1,w

1)|

time

Direction convergence (w1)

200 400 600 800 10000

0.2

0.4

0.6

0.8

1

|cos

(v2,w

2)|

time


200 400 600 800 10000

0.2

0.4

0.6

0.8

1

|cos

(v1,w

1)|time


200 400 600 800 10000

0.2

0.4

0.6

0.8

1

|cos

(v2,w

2)|

time


(a)λ = 1 (b) λ = 0.96

Figure 12.9. Hidden variable tracking accuracy.

This is always the case, provided the corresponding eigenvalue is largeenoughand fairly well-separated from the others. If the eigenvalue is small, then thecorresponding hidden variable is of no importance and we do not track it anyway.

Reconstruction error Figure 12.3 shows the reconstruction error,∑ ‖xt −

xt‖2/∑ ‖xt‖2, achieved by SPIRIT. In every experiment, we set the energy

thresholds to[fE , FE ] = [0.95, 0.98]. Also, as pointed out before, we setλ = 0.96 as a reasonable default value to deal with non-stationarities thatmay be present in the data, according to recommendations in the literature (11).Since we want a metric of overall quality, the MSE rate weighs each observationequally and does not take into account the forgetting factorλ.

Still, the MSE rate is very close to the bounds we set. In Figure 12.3 wealso show the MSE rate achieved by repeated PCA. As pointed out before, thisis already an unfair comparison. In this case, we set the number of principalcomponentsk to the maximum that SPIRIT uses at any point in time. Thischoice favours repeated PCA even further. Despite this, the reconstructionerrors of SPIRIT are close to the ideal, while using orders of magnitude lesstime and space.

Finally, Figure 12.9 illustrates the convergence to the “true” principal com-ponent directions on a synthetic dataset. First, we compare against the PCAofthe entire dataX, with λ = 1. We see convergence is almost immediate. Thisis always the case, provided the corresponding eigenvalue is large enough andfairly well-separated from the others. However, if the eigenvalue is small, thenthe corresponding hidden variable is of no importance and we do not trackitanyway. Whenλ < 1, the problem is harder, because thewi gradually shift


over time. Forλ < 1, we compare against repeated PCA (using the firsttrows ofX, appropriately weighted). We see that we can still track the shiftingprincipal component directions well.

9. Conclusion

We focus on finding patterns, correlations and hidden variables, in a largenumber of streams. SPIRIT has the following desirable characteristics:

(i) It discovers underlying correlations among multiple streams, incremen-tally and in real-time (21) and provides a very compact representation of thestream collection, via a fewhidden variables.

(ii) It automatically estimates the numberk of hidden variables to track, andit can automatically adapt, ifk changes (e.g., an air-conditioner switching on,in a temperature sensor scenario).

(iii) It scales up extremely well, both on database size (i.e., number of timeticks t), and on the numbern of streams. Therefore it is suitable for a largenumber of sensors / data sources.

(iv) Its computation demands are low: it only needsO(nk) floating pointoperations—no matrix inversions nor SVD (both infeasible in online, any-timesettings). Its space demands are similarly limited.

(v) It can naturally hook up with any forecasting method, and thus easily doprediction, as well as handle missing values.

We showed that the output of SPIRIT has a natural interpretation. We eval-uated our method on several datasets, where indeed it discovered the hiddenvariables. Moreover, SPIRIT-based forecasting was several times faster thanother methods.

Acknowledgments

We wish to thank Michael Bigrigg for the temperature sensor data and OrnaRaz for collecting the river gauge data.

References

[1] Aggarwal, Charu C. Han, Jiawei, and Yu, Philip S. (2003). A frameworkfor clustering evolving data streams. InVLDB.

[2] Ali, M. H. Mokbel, Mohamed F. Aref, Walid, and Kamel, Ibrahim(2005). Detection and tracking of discrete phenomena in sensor networkdatabases. InSSDBM.

[3] Brockwell, Peter J. and Davis, Richard A. (1991).Time Series: Theoryand Methods. Springer Series in Statistics. Springer-Verlag, 2nd edition.

[4] Deligiannakis, Antonios, Kotidis, Yiannis, and Roussopoulos, Nick(2004). Compressing historical information in sensor networks. InSIG-MOD.

[5] Diamantaras, Kostas I. and Kung, Sun-Yuan (1996).Principal ComponentNeural Networks: Theory and Applications. John Wiley.

[6] Domingos, Pedro and Hulten, Geoff (2000). Mining high-speed datastreams. InKDD.

[7] Fukunaga, Keinosuke (1990).Introduction to Statistical Pattern Recog-nition. Academic Press.

[8] Ganti, Venkatesh, Gehrke, Johannes, and Ramakrishnan, Raghu (2002).Mining data streams under block evolution.SIGKDD Explorations,3(2):1–10.

[9] Guha, Sudipto, Gunopulos, Dimitrios, and Koudas, Nick (2003a). Corre-lating synchronous and asynchronous data streams. InKDD.

[10] Guha, Sudipto, Meyerson, Adam, Mishra, Nina, Motwani, Rajeev, andO’Callaghan, Liadan (2003b). Clustering data streams: Theory and prac-tice. IEEE TKDE, 15(3):515–528.

[11] Haykin, Simon (1992).Adaptive Filter Theory. Prentice Hall.


[12] Hulten, Geoff, Spencer, Laurie, and Domingos, Pedro (2001).Miningtime-changing data streams. InKDD.

[13] Jolliffe, I.T. (2002).Principal Component Analysis. Springer.

[14] Kailath, Thomas (1980).Linear Systems. Prentice Hall.

[15] Keogh, Eamonn, Lonardi, Stefano, and Ratanamahatana, ChotiratAnn(2004). Towards parameter-free data mining. InKDD.

[16] Lin, Jessica, Vlachos, Michail, Keogh, Eamonn, and Gunopulos, Dimitrios(2004). Iterative incremental clustering of time series. InEDBT.

[17] Oja, Erkki (1989). Neural networks, principal components, and subspaces.Intl. J. Neural Syst., 1:61–68.

[18] Palpanas, Themistoklis, Vlachos, Michail, Keogh, Eamonn, Gunopulos,Dimitrios, and Truppel, Wagner (2004). Online amnesic approximationof streaming time series. InICDE.

[19] Papadimitriou, Spiros, Brockwell, Anthony, and Faloutsos, Christos(2003). Adaptive, hands-off stream mining. InVLDB.

[20] Sakurai, Yasushi, Papadimitriou, Spiros, and Faloutsos, Christos (2005).BRAID: Stream mining through group lag correlations. InSIGMOD.

[21] Sun, Jimeng, Papadimitriou, Spiros, and Faloutsos, Christos (2005). On-line latent variable detection in sensor networks. InICDE. (demo).

[22] Wang, Haixun, Fan, Wei, Yu, Philip S., and Han, Jiawei (2003). Miningconcept-drifting data streams using ensemble classifiers. InKDD.

[23] Yang, Bin (1995). Projection approximation subspace tracking.IEEETrans. Sig. Proc., 43(1):95–107.

[24] Yi, Byoung-Kee, Sidiropoulos, N.D. Johnson, Theodore, Jagadish, H.V.Faloutsos, Christos, and Biliris, Alexandros (2000). Online data miningfor co-evolving time sequences. InICDE.

[25] Young, Peter (1984).Recursive Estimation and Time-Series Analysis: AnIntroduction. Springer-Verlag.

[26] Zhang, Tian, Ramakrishnan, Raghu, and Livny, Miron (1996).BIRCH:An efficient data clustering method for very large databases. InSIGMOD.

[27] Zhu, Yunyue and Shasha, Dennis (2002). StatStream: Statistical monitor-ing of thousands of data streams in real time. InVLDB.

Chapter 13

A SURVEY OF DISTRIBUTED MINING OF DATASTREAMS

Srinivasan ParthasarathyDepartment of Computer Science and EngineeringThe Ohio State University

[email protected]

Amol GhotingDepartment of Computer Science and EngineeringThe Ohio State University

[email protected]

Matthew Eric OteyDepartment of Computer Science and EngineeringThe Ohio State University

[email protected]

Abstract With advances in data collection and generation technologies, organizationsandresearchers are faced with the ever growing problem of how to manageandanalyze large dynamic datasets. Environments that produce streaming sources ofdata are becoming common place. Examples include stock market, sensor, webclick stream, and network data. In many instances, these environments are alsoequipped with multiple distributed computing nodes that are often located nearthe data sources. Analyzing and monitoring data in such environments requiresdata mining technology that is cognizant of the mining task, the distributed natureof the data, and the data influx rate. In this chapter, we survey the current stateof the field and identify potential directions of future research.

1. Introduction

Advances in technology have enabled us to collect vast amounts of data fromvarious sources, whether they be from experimental observations, simulations,


sensors, credit card transactions, or from networked systems. To benefit fromthese enhanced data collecting capabilities, it is clear that semi-automated in-teractive techniques such as data mining should be employed to process andanalyze the data. It is also desirable to have interactive response times to clientqueries, as the process is often iterative in nature (with a human in the loop).The challenges to meet these criteria are often daunting as detailed next.

Although inexpensive storage space makes it possible to maintain vast vol-umes of data, accessing and managing the data becomes a performance issue.Often one finds that a single node is incapable of housing such large datasets.Efficient and adaptive techniques for data access, data storage and communi-cation (if the data sources are distributed) are thus necessary. Moreover, datamining becomes more complicated in the context of dynamic databases, wherethere is a constant influx of data. Changes in the data can invalidate existingpatterns or introduce new ones. Re-executing the algorithms from scratchleadsto large computational and I/O overheads. These two factors have led to thedevelopment of distributed algorithms for analyzing streaming data which isthe focus of this survey article.

Many systems use a centralized model for mining multiple data streams [2].Under this model the distributed data streams are directed to one central locationbefore they are mined. A schematic diagram of a centralized data stream miningsystem is presented in Figure 13.1. Such a model of computation is limited inseveral respects. First, centralized mining of data streams can result in longresponse time. While distributed computing resources may be available, theyare not fully utilized. Second, central collection of data can result in heavytraffic over critical communication links. If these communication links havelimited network bandwidth, network I/O may become a performance bottleneck.Furthermore, in power constrained domains such as sensor networks, this canresult in excessive power consumption due to excessive data communication.

To alleviate the aforementioned problems, several researchers have proposeda model that is aware of the distributed sources of data, computational resources,and communication links. A schematic diagram of such a distributed streammining system is presented in Figure 13.1 and can be contrasted with the cen-tralized model. In the model of distributed stream mining, instead of offloadingthe data to one central location, the distributed computing nodes perform partsof the computation close to the data, while communicating the local modelsto a central site as and when needed. Such an architecture provides severalbenefits. First, by using distributed computing nodes, it allows the derivationof a greater degree of parallelism, thus reducing response time. Second, asonly local models need to be communicated, communication can potentiallybe reduced,improving scalability, and reducing power consumption in powerconstrained domains.

A Survey of Distributed Mining of Data Streams 291

FINAL MODELFINAL MODEL

ALGORITHMDATA MINING



DATA MININGALGORITHM

LOCAL MODELLOCAL MODELLOCAL MODEL


DATA INTEGRATION

SOURCEDATA

SOURCEDATA

SOURCEDATA

SOURCEDATA

SOURCEDATA

SOURCEDATA

Figure 13.1. Centralized Stream Processing Architecture (left) Distributed Stream ProcessingArchitecture (right)

This chapter presents a brief overview of distributed stream mining algo-rithms, systems support, and applications, together with emerging research di-rections. We attempt to characterize and classify these approaches as to whetherthey belong in the centralized model or the distributed model. The rest of thischapter is organized as follows. First, we present distributed stream miningalgorithms for various mining tasks such as outlier detection, clustering, fre-quent itemset mining, classification, and summarization. Second, we presentanoverview of distributed stream mining in resource constrained domains. Third,we summarize research efforts on building systems support for facilitating dis-tributed stream mining. Finally, we conclude with emerging research directionsin distributed stream mining.

2. Outlier and Anomaly Detection

The goal in outlier or anomaly detection is to find data points that are mostdifferent from the remaining points in the data set [4]. Most outlier detectionalgorithms are schemes in which the distance between every pair of pointsis calculated, and the points most distant from all other points are marked asoutliers [29]. This is anO(n2) algorithm that assumes a static data set. Suchapproaches are difficult to extend to distributed streaming data sets. Points inthese data sets arrive at multiple distributed end-points, which may or may notbe compute nodes, and must be processed incrementally. Such constraintsleadus away from purely distance-based approaches, and towards more heuristic


techniques. Note that the central issue in many anomaly detection systems, isto identify anomalies inreal-timeor as close to real time as possible thus makingit a natural candidate for many streaming applications. Moreover, often timesthe data is produced at disparate sites making distributed stream mining a naturalfit for this domain. In this section we review the work in outlier or anomalydetection most germane to distributed stream mining.

Various application-specific approaches to outlier/anomaly detection havebeen proposed in the literature. An approach [39] has been presentedfor dis-tributed deviation detection in sensor networks. This approach is tailored tothe sensor network domain and targets misbehaving sensors. The approachmaintains density estimates of values seen by a sensor, and flags a sensor tobe a misbehaving sensor if its value deviates significantly from the previouslyobserved values. This computation is handled close to the sensors in a dis-tributed fashion, with only results being reported to the central server as andwhen needed.

One of the most popular applications of distributed outlier detection is thatof network intrusion detection. Recent trends have demanded a distributedapproach to intrusion detection on the Internet. The first of these trends isamove towards distributed intrusions and attacks, that is to say, intrusions andattacks originating from a diverse set of hosts on the internet. Another trendis the increasing heterogeneous nature of the Internet, where different hosts,perhaps residing in the same subnetwork have differing security requirements.For example, there have been proposals for distributed firewalls [20] for fulfill-ing diverse security requirements. Also, the appearance of mobile and wirelesscomputing has created dynamic network topologies that are difficult, if notimpossible, to protect from a centralized location. Efficient detection and pre-vention of these attacks requires distributed nodes to collaborate. By itself,anode can only collect information about the state of the network immediatelysurrounding it, which may be insufficient to detect distributed attacks. If thenodes collaborate by sharing network audit data, host watch lists, and models ofknown network attacks, each can construct a better global model of the network.

Oteyet al [36], present a distributed outlier detection algorithm targeted atdistributed online streams, specifically to process network data collected at dis-tributed sites. Their approach finds outliers based on the number of attributedependencies violated by a data point in continuous, categorical, and mixedattribute spaces. They maintain an in-memory structure that succinctly sum-marizes the required dependency information. In order to find exact outliersin a distributed streaming setting, the in-memory summaries would need to beexchanged frequently. These summaries can be large, and consequently, in adistributed setting, each distributed computing node only exchanges local out-liers with the other computing nodes. A point is deemed to be a global outlierif every distributed node believes it to be an outlier based on its local model


of normalcy. While such an approach will only find approximate outliers, theauthors show that this heuristic works well in practice. While the authors reportthat to find exact outliers they need to exchange a large summary which leadsto excessive communication, it could be possible to exchange only decisiveparts of the summary, instead of the entire summary, in order to more accu-rately detect the true outliers. Furthermore, their in-memory summaries arelarge, as they summarize a large amount of dependency information. Reducingthis memory requirement could potentially allow the use of this algorithm inresource-constrained domains.

Emerald is an approach for collaborative intrusion detection for large net-works within an enterprise [42]. This approach allows for distributed protectionof the network through a hierarchy of surveillance systems that analyze networkdata at the service, domain, and enterprise-wide levels. However,Emeralddoes not provide mechanisms for allowing different organizations to collabo-rate. Locastoet al [33] examine techniques that allow different organizationsto do such collaboration for enhanced network intrusion detection. If organi-zations can collaborate, then each can build a better model of global networkactivity, and more precise models of attacks (since they have more data fromwhich to estimate the model parameters). This allows for better characterizationand prediction of attacks. Collaboration is achieved through the exchangeofBloom filters, each of which encodes a list of IP addresses of suspicious hoststhat a particular organization’s Intrusion Detection System (IDS) has detected,as well as the ports which these suspicious hosts have accessed. The use ofBloom filters helps both to keep each collaborating organization’s informationconfidential and to reduce the amount of data that must be exchanged.

A major limitation of this approach is that information exchanged may notbe sufficient to identify distributed attacks. For example, it is possible thatan attack may originate from a number of hosts, none of which are suspiciousenough to be included on any organization’s watch list. However, the combinedaudit data collected by each organization’s IDS may be sufficient to detectthatattack. To implement such a system, two problems must be addressed. Thefirst is that each organization may collect disjoint sets of features. Collaboratingorganizations must agree beforehand on a set of common features to use. Someideas for common standards for intrusion detection have been realized with theCommon Intrusion Detection Framework (CIDF) [31]. The second problemisthat of the privacy of each organization’s data. It may not be practicalto useBloom filters to encode a large set of features. However, techniques doexistfor privacy-preserving data mining [28, 23, 32] that will allow organizations tocollaborate without compromising the privacy of their data.

There have been other approaches for detecting distributed denial-of-serviceattacks. Leeet alhave proposed a technique for detecting novel and distributedintrusions based on the aforementioned CIDF [31]. The approach not only


allows nodes to share information with which they can detect distributed attacks,but also allows them to distribute models of novel attacks. Yuet al propose amiddleware-based approach to prevent distributed denial of service attacks [45].Their approach makes use of Virtual Private Operation Environments (VPOE)to allow devices running the middleware to collaborate. These devices can actas firewalls or network monitors, and their roles can change as is necessary.Each device contains several modules, including an attack detection module,a signaling module for cooperating with other devices, and policy processingmodules.

Some work in network intrusion detection has been done in the domain ofmobile ad hoc networks (MANETs) [47, 18], where nodes communicate overa wireless medium. In MANETs, the topology is dynamic, and nodes mustcooperate in order to route messages to their proper destinations. Because ofthe open communication medium, dynamic topology, and cooperative nature,MANETs are especially prone to network intrusions, and present difficultiesfor distributed intrusion detection.

To protect against intrusions, Zhangat al have proposed several intrusiondetection techniques [46, 47]. In their proposed architecture, each node in thenetwork participates in detection and response, and each is equipped with alocal detection engine and a cooperative detection engine. The local detectionengine is responsible for detecting intrusions from the local audit data. Ifa nodehas strong evidence that an intrusion is taking place, it can initiate a response tothe intrusion. However, if the evidence is not sufficiently strong, it can initiate aglobal intrusion detection procedure through the cooperative detection engine.The nodes only cooperate by sharing their detection states, not their auditdata,and so it is difficult for each node to build an accurate global model of thenetwork with which to detect intrusions. In this case, intrusions detectable onlyat the global level (e.g. ip sweeps) will be missed. However, the authors do pointout that they only use local data since the remote nodes may be compromisedand their data may not be trustworthy.

In another paper [18], Huang and Lee present an alternative approach tointrusion detection in MANETs. In this work, the intrusions to be detected areattacks against the structure of the network itself. Such intrusions are thosethat corrupt routing tables and protocols, intercept packets, or launchnetwork-level denial-of-service attacks. Since MANETs typically operate on batterypower, it may not be cost effective for each node to constantly run its ownintrusion detection system, especially when there is a low threat level. Theauthors propose that a more effective approach would be for a clusterof nodesin a MANET to elect one node as a monitor (theclusterhead) for the entirecluster. Using the assumption that each node can overhear network traffic inits transmission range, and that the other cluster members can provide (someof) the features (since the transmission ranges of the clusterhead and theother


cluster members may not overlap, the other cluster members may have statisticson portions of the cluster not accessible to the clusterhead), the clusterhead isresponsible for analyzing the flow of packets in its cluster in order to detectintrusions and initiate a response. In order for this intrusion detection approachto be effective, the election of the clusterhead must be fair, and each clusterheadmust serve an equal amount of time. The first requirement ensures that theelection of the clusterhead is unbiased (i.e. a compromised node cannot tilt theelection in its favor), and the second requirement ensures that a compromisednode cannot force out the current clusterhead nor remain as clusterhead for anunlimited period of time. There is a good division of labor, as the clusterheadis the only member of the cluster that must run the intrusion detection system;the other nodes need only collect data and send it to the clusterhead. However,a limitation of this approach is that not all intrusions are visible at the globallevel, especially given the feature set the detection system uses (statistics on thenetwork topology, routes, and traffic). Such local intrusions include exploits ofservices running on a node, which may only be discernible using the content ofthe traffic.

3. Clustering

The goal in clustering is to partition a set of points into groups such thatpoints within a group are similar in some sense and points in different groupsare dissimilar in the same sense. In the context of distributed streams, one wouldwant to process the data streams in a distributed fashion, while communicatingthe summaries, and to arrive at global clustering of the data points. Guhaet al [17], present an approach for clustering data streams. Their approachproduces a clustering of the points seen using small amounts of memory andtime. The summarized data consists of the cluster centers together with thenumber of points assigned to that cluster. The k-median algorithm is used asthe underlying clustering mechanism. The resulting clustering is a constantfactor approximation of the true clustering. As has been shown in [16], thisalgorithm can be easily extended to operate in a distributed setting. Essentially,clusterings from each distributed site can be combined and clustered to find theglobal clustering with the same approximation factor. From a qualitative standpoint, in many situations, k-median clusters are known to be less desirable thanthose formed by other clustering techniques. It would be interesting to seeif other clustering algorithms that produce more desirable clusterings can beextended with the above methodology to operate over distributed streams.

Januzajet al [21], present a distributed version of the density-based cluster-ing algorithm,DBSCAN. Essentially, each site builds a local density-basedclustering, and then communicates a summary of the clustering to a centralsite. The central site performs a density-based clustering on the summaries


obtained from all sites to find a global clustering. This clustering is relayedback to the distributed sites that update their local clusterings based on the dis-covered global clustering. While this approach is not capable of processingdynamic data, in [13], the authors have shown that density based clustering canbe performed incrementally. Therefore, a distributed and incremental versionof DBSCAN can potentially be devised. However, like the distributed versionpresented by Januzajet al, we cannot provide a guarantee on the quality of theresult.

Beringer and Hullermeir consider the problem of clustering parallel datastreams [5]. Their goal is to find correlated streams as they arrive synchronously.The authors represent the data streams using exponentially weighted slidingwindows. The discrete Fourier transform is computed incrementally, and k-Means clustering is performed in this transformed space at regular intervalsof time. Data streams belonging to the same cluster are considered to be cor-related. While the processing is centralized, the approach can be tailored tocorrelate distributed data streams. Furthermore, the approach is suitable foronline streams. It is possible that this approach can be extended to a distributedcomputing environment. The Fourier coefficients can be exchanged incremen-tally and aggregated locally to summarize remote information. Furthermore,one can potentially produce approximate results by only exchanging the signif-icant coefficients.

4. Frequent itemset mining

The goal in frequent itemset mining is to find groups of items or values thatco-occur frequently in a transactional data set. For instance, in the contextof market data analysis, a frequent two itemset could bebeer, chips, whichmeans that people frequently buy beer and chips together. The goal in frequentitemset mining is to find all itemsets in a data set that occur at leastx numberof times, wherex is the minimum support parameter provided by the user.

Frequent itemset mining is both CPU and I/O intensive, making it very costlyto completely re-mine a dynamic data set any time one or more transactions areadded or deleted. To address the problem of mining frequent itemsets fromdy-namic data sets, several researchers have proposed incremental techniques [10,11, 14, 30, 43, 44]. Incremental algorithms essentially re-use previously minedinformation and try to combine this information with the fresh data to efficientlycompute the new set of frequent itemsets. However, it can be the case thatthedatabase may be distributed over multiple sites, and is being updated at differentrates at each site, which requires the use of distributed asynchronous frequentitemset mining techniques.

Otey et al [38], present a distributed incremental algorithm for frequentitemset mining. The approach is capable of incrementally finding maximal


frequent itemsets in dynamic data. Maximal frequent itemsets are those thatdo not have any frequent supersets, and the set of maximal frequentitemsetsdetermines the complete set of frequent itemsets. Furthermore, it is capable ofmining frequent itemsets in a distributed setting. Distributed sites can exchangetheir local maximal frequent itemsets to obtain a superset of the global maximalfrequent itemsets. This superset is then exchanged between all nodes so thattheir local counts may be obtained. In the final round of communication, areduction operation is performed to find the exact set of global maximal frequentitemsets.

Manku and Motwani [35], present an algorithm for mining frequent itemsetsover data streams. In order to mine all frequent itemsets in constant space,they employ a down counting approach. Essentially, they update the supportcounts for the discovered itemsets as the data set is processed. Furthermore,for all the discovered itemsets, they decrement the support count by a specificvalue. As a result, itemsets that occur rarely will have their count set to zeroand will be eventually eliminated from list. If they reappear later, their countis approximated. While this approach is tailored to data streams, it is notdistributed. The methodology proposed in [38] can potentially be applied tothis algorithm to process distributed data streams.

Manjhi et al [34], extend Manku and Motwani’s approach to find frequentitems in the union of multiple distributed streams. The central issue is how tobest manage the degree of approximation performed as partial synopsesfrommultiple nodes are combined. They characterize this process for hierarchicalcommunication topologies in terms of a precision gradient followed by syn-opses as they are passed from leaves to the root and combined incrementally.They studied the problem of finding the optimal precision gradient under twoalternative and incompatible optimization objectives: (1) minimizing load onthe central node to which answers are delivered, and (2) minimizing worst-caseload on any communication link. While this approach targets frequent itemsonly, it would be interesting to see if it can be extended to find frequent itemsets.

5. Classification

Hulten and Domingos [19], present a one-pass decision tree constructionalgorithm for streaming data. They build a tree incrementally by observingdata as it streams in and splitting a node in the tree when a sufficient numberof samples have been seen. Their approach uses the Hoeffding inequality toconverge to a sample size. Jin and Agrawal revisit this problem and presentsolutions that speed up split point calculation as well as reduce the desiredsample size to achieve the same level of accuracy [22]. Both these approachesare not capable of processing distributed streams.


Kargupta and Park present an approach for aggregating decision trees con-structed at distributed sites [26]. As each decision tree can be represented asa numeric function, the authors propose to transmit and aggregate these treesby using their Fourier representations. They also show that the Fourier-basedrepresentation is suitable for approximating a decision tree, and thus, suitablefor transmission in bandwidth-limited mobile environments. Coupled with astreaming decision tree construction algorithm, this approach should be capableof processing distributed data streams.

Chenet al[8], present a collective approach to mine Bayesian networks fromdistributed heterogeneous web-log data streams. In their approach, theylearna local Bayesian network at each site using the local data. Then each site iden-tifies the observations that are most likely to be evidence of coupling betweenlocal and non-local variables and transmits a subset of these observations to acentral site. Another Bayesian network is learned at the central site usingthedata transmitted from the local sites. The local and central Bayesian networksare combined to obtain a collective Bayesian network, that models the entiredata. This technique is then suitably adapted to an online Bayesian learningtechnique, where the network parameters are updated sequentially basedon newdata from multiple streams. This approach is particularly suitable for miningapplications with distributed sources of data streams in an environment withnon-zero communication cost (e.g. wireless networks).

6. Summarization

Bulut and Singh [6], propose a novel technique to summarize a data streamincrementally. The summaries over the stream are computed at multiple resolu-tions, and together they induce a unique Wavelet-based approximation tree.Theresolution of approximations increases as we move from the root of the approx-imation tree down to its leaf nodes. The tree has space complexityO(logN),whereN denotes the current size of the stream. The amortized processing costfor each new data value isO(1). These bounds are currently the best known forthe algorithms that work under a biased query model where the most recentval-ues are of a greater interest. They also consider the scenario in which a centralsource site summarizes a data stream at multiple resolutions. The clients aredistributed across the network and pose queries. The summaries computed atthe central site are cached adaptively at the clients. The access pattern,i.e. readsand writes, over the stream results in multiple replication schemes at differentresolutions. Each replication scheme expands as the corresponding readrateincreases, and contracts as the corresponding write rate increases. This adaptivescheme minimizes the total communication cost and the number of inter-sitemessages. While the summarization process is centralized, it can potentially


be used to summarize distributed streams at distributed sites by aggregatingwavelet coefficients.

The problem of pattern discovery in a large number of co-evolving streamshas attracted much attention in many domains. Papadimitriouet al introduceSPIRIT (Streaming Pattern dIscoveRy in multIple Time-series) [40], a com-prehensive approach to discover correlations that effectively and efficientlysummarize large collections of streams. The approach uses very less memoryand both its memory requirements and processing time are independent of thestream length. It scales linearly with the number of streams and is adaptive andfully automatic. It dynamically detects changes (both gradual and sudden)inthe input streams, and automatically determines the number of hidden variables.The correlations and hidden variables discovered have multiple uses. They pro-vide a succinct summary to the user, they can help to do fast forecasting anddetect outliers, and they facilitate interpolations and handling of missing values.While the algorithm is centralized, it targets multiple distributed streams. Theapproach can potentially be used to summarize streams arriving at distributedsites.

Babcock and Olston [3], study a useful class of queries that continuouslyreport thek largest values obtained from distributed data streams (“top-k mon-itoring queries"), which are of particular interest because they can be used toreduce the overhead incurred while running other types of monitoring queries.They show that transmitting entire data streams is unnecessary to support thesequeries. They present an alternative approach that significantly reduces com-munication. In their approach, arithmetic constraints are maintained at remotestream sources to ensure that the most recently provided top-k answer remainsvalid to within a user-specified error tolerance. Distributed communication isonly necessary on the occasion when constraints are violated.

7. Mining Distributed Data Streams in ResourceConstrained Environments

Recently, there has been a lot of interest in environments that demand dis-tributed stream mining where resources are constrained. For instance, inthesensor network domain, due to energy consumption constraints, excessive com-munication is undesirable. One can potentially perform more computation andless communication to perform the same task with reduced energy consumption.Consequently, in such scenarios, data mining algorithms (specifically clusteringand classification) with tunable computation and communication requirementsare needed [24, 39].

A similar set of problems have recently been looked at in the network intru-sion detection community. Here, researchers have proposed to offload compu-tation related to monitoring and intrusion detection on to the network interface


card (NIC) [37] with the idea of enhancing reliability and reducing the con-straints imposed on the host processing environment. Initial results in thisdomain convey the promise of this area but there are several limiting criteriain current generation NICs (e.g. programming model, lack of floating pointoperations) that may be alleviated in next generation NICs.

Karguptaet al present Mobimine [27], a system for intelligent analysis oftime-critical data using a Personal Data Assistant (PDA). The system monitorsstock market data and signals interesting stock behavior to the user. Stocksareinteresting if they may positively or negatively affect the stock portfolio of theuser. Furthermore, to assist in the user’s analysis, they transmit classificationtrees to the user’sPDA using the Fourier spectrum-based approach presentedearlier. As discussed previously, this Fourier spectrum-based representation iswell suited to environments that have limited communication bandwidth.

The Vehicle Data Stream Mining System (VEDAS) [25], is a mobile anddistributed data stream mining/monitoring application that taps into the contin-uous stream of data generated by most modern vehicles. It allows continuouson-board monitoring of the data streams generated by the moving vehicles,identifying the emerging patterns, and reporting them to a remote control cen-ter over a low-bandwidth wireless network connection. The system offers manypossibilities such as real-time on-board health monitoring, drunk-driving detec-tion, driver characterization, and security related applications for commercialfleet management. While there has been initial work in such constrained envi-ronments, we believe that there is still a lot to be done in this area.

8. Systems Support

A distributed stream mining system can be complex. It typically consistsof several sub-components such as the mining algorithms, the communicationsub-system, the resource manager, the scheduler, etc. A successfulstreammining system must adapt to the dynamics of the data and best use the availableset of resources and components. In this section, we will briefly summarizeefforts that target the building of system support for resource-aware distributedprocessing of streams.

When processing continuous data streams, data arrival can be bursty,and thedata rate may fluctuate over time. Systems that seek to give rapid or real-timequery responses in such an environment must be prepared to deal gracefullywith bursts in data arrival without compromising system performance. Babcocket al [1] show that the choice of an operator scheduling strategy can have sig-nificant impact on the run-time system memory usage. When data streams arebursty, the choice of an operator scheduling strategy can result in significantlyhigh run-time memory usage and poor performance. To minimize memory uti-lization at peak load, they present Chain scheduling, an adaptive, load-aware


scheduling of query operators to minimize resource consumption during timesof peak load. This operator scheduling strategy for data stream systems isnear-optimal in minimizing run-time memory usage for single-stream queriesinvolving selections, projections, and foreign-key joins with stored relations.At peak load, the scheduling strategy selects an operator path (a set of consec-utive operators) that is capable of processing and freeing the maximum amountof memory per unit time. This in effect results in the scheduling of operatorsthat together are both selective and have a high aggregate tuple processing rate.

The aforementioned scheduling strategy is not targeted at the processingofdistributed streams. Furthermore, using the Chain operator scheduling strategyhas an adverse affect on response time and is not suitable for data mining ap-plications that need to provide interactive performance even under peakload.In order to mine data streams, we need a scheduling strategy that supportsboth response time and memory-aware scheduling of operators. Furthermore,when scheduling a data stream mining application with dependent operatorsin a distributed setting, the scheduling scheme should not need to communi-cate a significant amount of state information. Ghoting and Parthasarathy [16],propose an adaptive operator scheduling technique for mining distributeddatastreams with response time guarantees and bounded memory utilization. Theuser can tune the application to the desired level of interactivity, thus facilitatingthe data mining process. They achieve this through a step-wise degradationinresponse time beginning from a schedule that is optimal in terms of responsetime. This sacrifice in response time is used towards optimal memory utiliza-tion. After an initial scheduling decision is made, changes in system state mayforce a reconsideration of operator schedules. The authors show that a decisionas to whether a local state change will affect the global operator schedule canbe made locally. Consequently, each local site can proceed independently, evenunder minor state changes, and a global assignment is triggered only whenit isactually needed.

Plale considers the problem of efficient temporal-join processing in a dis-tributed setting [41]. In this work, the author’s goal is to optimize the joinprocessing of event streams to efficiently determine sets of events that occurtogether. The size of the join window cannot be determined apriori as this maylead to missed events. The author proposes to vary the size of the join win-dow depending on the rate of the incoming stream. The rate of the incomingstream gives a good indication of how many previous events on the stream canbe dropped. Reducing the window size also helps reduce memory utilization.Furthermore, instead of forwarding events into the query processing engine ona first-come first-serve basis, the author proposes to forward the earliest eventfirst to further improve performance, as this facilitates the earlier determinationof events that are a part of the join result.


Chenet alpresentGATES [7], a middleware for processing distributed datastreams. This middleware targets data stream processing in a grid setting andis built on top of the Open Grid Services Architecture. It provides a high levelinterface that allows one to specify a stream processing algorithm as a setofpipelined stages. One of the key design goals ofGATES is to support selfadaption under changing conditions. To support self adaptation, the middlewarechanges one or more of the sampling rate, the summary structure size, or thealgorithm used, based on changing conditions of the data stream. For instance,if the stream rate increases, the system reduces the sampling rate accordinglyto maintain a real-time response. If we do not adapt the sampling rate, wecould potentially face increasing queue sizes, resulting in poor performance.To support self adaptation, the programmer needs to provide the middlewarewith parameters that allow it to tune the application at runtime. The middlewarebuilds a simple performance model that allows it to predict how parameterchanges help in performance adaptation in a distributed setting.

Chi et al [12] present a load shedding scheme for mining multiple datastreams, although the computation is not distributed. They assume that the taskof reading data from the stream and building feature values is computationallyexpensive and is the bottleneck. Their strategies decide on how to expendlimited computation for building feature values for data on multiple streams.They decide on whether to drop a data item on the stream based on the historicutility of the items produced by the stream. If they choose not to build featurevalues for a data item, they simply predict feature values based on historicaldata. They use finite memory Markov chains to make such predictions. Whilethe approach presented by the authors is centralized, load shedding decisionscan be trivially distributed.

Conclusions and Future Research Directions

In this chapter, we presented a summary of the current state-of-the-artindistributed data stream mining. Specifically, algorithms for outlier detection,clustering, frequent itemset mining, classification, and summarization werepresented. Furthermore, we briefly described related applications and systemssupport for distributed stream mining.

First, the distributed sources of data that need to be mined are likely to spanmultiple organizations. Each of these organizations may have heterogeneouscomputing resources. Furthermore, the distributed data will be accessed bymultiple analysts, each potentially desiring the execution of a different miningtask. The various distributed stream mining systems that have been proposedto date do not take the variability in the tasks and computing resources intoaccount. To facilitate execution and deployment in such settings, a plug andplay system design that is cognizant of each organization’s privacy is necessary.


A framework in which services are built on top of each other will facilitate rapidapplication development for data mining. Furthermore, these systems will needto be integrated with existing data grid and knowledge grid infrastructures [9]and researchers will need to design middleware to support this integration.

Second, next generation computing systems for data mining are likely to bebuilt using off-the-shelf CPUs connected using a high bandwidth interconnect.In order to derive high performance on such systems, stream mining algorithmsmay need to be redesigned. For instance, next generation processorsare likelyto have multiple-cores on chip. As has been shown previously [15], data miningalgorithms are adversely affected by the memory-wall problem. This problemwill likely be exacerbated on future multi-core architectures. Therefore,streammining algorithms at each local site will need to be redesigned to derive highperformance on next generation architectures. Similarly, with innovations innetworking technologies, designs that are cognizant of high performance inter-connects (like Infiniband) will need to be investigated.

Third, as noted earlier, in many instances, environments that demand dis-tributed stream mining are resource constrained. This in turn requires the de-velopment of data mining technology that is tailored to the specific executionenvironment. Various tradeoffs, e.g. energy vs. communication, communi-cation vs. redundant computation etc., must be evaluated on a scenario-by-scenario basis. Consequently, in such scenarios, data mining algorithms withtunable computation and communication requirements will need to be devised.While initial forays in this domain have been made, a systematic evaluation ofthe various design tradeoffs even for a single application domain has not beendone. Looking further into the future, it will be interesting to evaluate if basedon specific solutions a more abstract set of interfaces can be developedfor ahost of application domains.

Fourth, new applications for distributed data stream mining are on the hori-zon. For example, RFID (radio frequency identification) technology is expectedto significantly improve the efficiency of business processes by allowing auto-matic capture and identification. RFID chips are expected to be embedded ina variety of devices, and the captured data will likely be ubiquitous in the nearfuture. New applications for these distributed streaming data sets will arise andapplication specific data mining technology will need to be designed.

Finally, over the past few years, several stream mining algorithms have beenproposed in the literature. While they are capable of operating in a centralizedsetting, many are not capable of operating in a distributed setting and cannotbetrivially extended to do so. In order to obtain exact or approximate (bounded)results in a distributed setting, the amount of state information that needs to beexchanged is usually excessive. To facilitate distributed stream mining algo-rithm design, instead of starting from a centralized solution, one needs to startwith a distributed mind-set right from the beginning. Statistics or summaries


that can be efficiently maintained in a distributed and incremental setting shouldbe designed and then specific solutions that use these statistics should be de-vised. Such a design strategy will facilitate distributed stream mining algorithmdesign.

References

[1] B. Babcock, S. Babu, M. Datar, and R. Motwani. Chain: Operator schedul-ing for memory minimization in data stream systems. InProceedings ofthe International Conference on Management of Data (SIGMOD), 2003.

[2] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Modelsand issues in data stream systems. InProceedings of the Symposium onPrinciples of Database Systems (PODS), 2002.

[3] B. Babcock and C. Olston. Distributed top k monitoring. InProceedings ofthe International Conference on Management of Data (SIGMOD), 2003.

[4] V. Barnett and T. Lewis.Outliers in Statistical Data. John Wiley andSons, 1994.

[5] J. Beringer and E. Hullermeier. Online clustering of parallel data streams.Data and Knowledge Engineering, 2005.

[6] A. Bulut and A. Singh. SWAT: Hierarchical stream summarization inlarge networks. InProceedings of the International Conference on DataEngineering (ICDE), 2003.

[7] L. Chen, K. Reddy, and G. Agrawal. GATES: A grid-based middleware forprocessing distributed data streams. InProceedings of the InternationalSymposium on High Performance Distributed Computing (HPDC), 2004.

[8] R. Chen, D. Sivakumar, and H. Kargupta. An approach to online bayesiannetwork learning from multiple data streams. InProceedings of the Inter-national Conference on Principles of Data Mining and Knowledge Dis-covery, 2001.

[9] A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, and S. Tuecke. Thedata grid: Towards an architecture for the distributed management andanalysis of large scientific data sets, 2001.

[10] D. Cheung, J. Han, V. Ng, and C. Y. Wong. Maintenance of discov-ered association rules in large databases: An incremental updating tech-nique. InProceedings of the International Conference on Data Engineer-ing (ICDE), 1996.

[11] D. Cheung, S. Lee, and B. Kao. A general incremental techniquefor main-taining discovered association rules. InProceedings of the InternationalConference on Database Systems for Advanced Applications, 1997.


[12] Y. Chi, P. Yu, H. Wang, and R. Muntz. Loadstar: A load shedding schemefor classifying data streams. InProceedings of the SIAM InternationalConference on Data Mining (SDM), 2005.

[13] M. Ester, H. Kriegel, J. Sander, M. Wimmer, and X. Xu. Incrementalclustering for mining in a data warehousing environment. InProceedingsof the International Conference on Very Large Data Bases (VLDB), 1998.

[14] V. Ganti, J. Gehrke, and Raghu Ramakrishnan. Demon–data evolutionand monitoring. InProceedings of the International Conference on DataEngineering (ICDE), 2000.

[15] A. Ghoting, G. Buehrer, S. Parthasarathy, D. Kim, A. Nguyen, Y.Chen,and P. Dubey. A characterization of data mining algorithms on a mod-ern processor. InProceedings of the ACM SIGMOD Workshop on DataManagement on New Hardware, 2005.

[16] A. Ghoting and S. Parthasarathy. Facilitating interactive distributed datastream processing and mining. InProceedings of the International Paralleland Distributed Processing Symposium (IPDPS), 2004.

[17] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering datastreams. InProceedings of the Symposium on Foundations of ComputerScience(FOCS), 2000.

[18] Yi-An Huang and Wenke Lee. A cooperative intrusion detection systemfor ad hoc networks. InProceedings of the 1st ACM workshop on Securityof ad hoc and sensor networks, 2003.

[19] G. Hulten and P. Domingos. Mining high speed data streams. InProceed-ings of the International Conference on Knowledge Discovery and DataMining (SIGKDD), 2000.

[20] Sotiris Ioannidis, Angelos D. Keromytis, Steven M. Bellovin, andJonathan M. Smith. Implementing a distributed firewall. InACM Confer-ence on Computer and Communications Security, 2000.

[21] E. Januzaj, H. Kriegel, and M. Pfeifle. DBDC: Density based distributedclustering. InProceedings of the International Conference on ExtendingData Base Technology (EDBT), 2004.

[22] R. Jin and G. Agrawal. Efficient decision tree construction on stream-ing data. InProceedings of the International Conference on KnowledgeDiscovery and Data Mining (SIGKDD), 2003.

[23] M. Kantarcioglu and C. Clifton. Privacy-preserving distributed miningof association rules on horizontally partitioned data. InProceedings ofthe ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery, 2002.

[24] H. Kargupta. Distributed data mining for sensor networks. InTutorialpresented at ECML/PKDD, 2004.


[25] H. Kargupta, R. Bhargava, K. Liu, M. Powers, P. Blair, S. Bushra, J. Dull,K. Sarkar, M. Klein, M. Vasa, and D. Handy. Vedas: A mobile anddistributed data stream mining system for real-time vehicle monitoring.In Proceedings of the SIAM International Conference on Data Mining(SDM), 2004.

[26] H. Kargupta and B. Park. A fourier spectrum based approach torepresentdecision trees for mining data streams in mobile environments.IEEETransactions on Knowledge and Data Engineering, 2004.

[27] H. Kargupta, B. Park, S. Pittie, L. Liu, D. Kushraj, and K. Sarkar. Mo-biMine: Monitoring the stock market from a PDA.SIGKDD Explorations,2002.

[28] Hillol Kargupta, Souptik Datta, Qi Wang, and Krishnamoorthy Sivaku-mar. On the privacy preserving properties of random data perturbationtechniques. InProceedings of the International Conference on Data Min-ing (ICDM), 2003.

[29] E. Knorr and R. T. Ng. Algorithms for mining distance-based outliers inlarge datasets. InProceedings of the International Conference on VeryLarge Data Bases (VLDB), 1998.

[30] S. Lee and D. Cheung. Maintenance of discovered association rules: Whento update? InProceedings of the Workshop on Research Issues in DataMining and Knowledge Discovery, 1997.

[31] Wenke Lee, Rahul A. Nimbalkar, Kam K. Yee, Sunil B. Patil, Pragneshku-mar H. Desai, Thuan T. Tran, and Salvatore J. Stolfo. A data mining andCIDF based approach for detecting novel and distributed intrusions.Lec-ture Notes in Computer Science, 2000.

[32] Yehuda Lindell and Benny Pinkas. Privacy preserving data mining. Lec-ture Notes in Computer Science, 1880:36, 2000.

[33] M. Locasto, J. Parekh, S. Stolfo, A. Keromytis, T. Malkin, and V. Misra.Collaborative distributed intrusion detection. Technical report, ColumbiaUniversity, 2004.

[34] A. Manjhi, V. Shkapenyuk, K. Dhamdhere, and C. Olston. Finding (re-cently) frequent items in distributed data streams. InProceedings of theInternational Conference on Data Engineering (ICDE), 2005.

[35] G. Manku and R. Motwani. Approximate frequency counts over datastreams. InProceedings of the International Conference on Very LargeData Bases (VLDB), 2002.

[36] M. Otey, A. Ghoting, and S. Parthasarathy. Fast distributed outlier detec-tion in mixed attribute data sets.Data Mining and Knowledge DiscoveryJournal, 2006.


[37] M. Otey, S. Parthasarathy, A. Ghoting, G. Li, S. Narravula, and D. Panda.Towards nic-based instrusion detection. InProceedings of the Interna-tional Conference on Knowledge Discovery and Data Mining (SIGKDD),2003.

[38] M. Otey, C. Wang, S. Parthasarathy, A. Veloso, and W. Meira. Miningfrequent itemsets in distributed and dynamic databases. InProceedingsof the International Conference on Data Mining (ICDM), 2003.

[39] T. Palpanas, D. Papadopoulos, V. Kalogeraki, and D. Gunopulos. Dis-tributed deviation detection in sensor networks.SIGMOD Record, 2003.

[40] S. Papadimitriou, J. Sun, and C. Faloutsos. Streaming pattern discoveryin multiple time series. InProceedings of the International Conferenceon Very Large Data Bases (VLDB), 2005.

[41] B. Plale. Learning run time knowledge about event rates to improvememory utilization in wide area stream filtering. InProceedings of theInternational Symposium on High Performance Distributed Computing(HPDC), 2002.

[42] P. Porras and P. Neumann. EMERALD: Event monitoring enabling re-sponses to anomalous live disturbances. InProceedings of the NationalInformation Systems Security Conference, 1997.

[43] S. Thomas, S. Bodagala, K. Alsabti, and S. Ranka. An efficient algorithmfor the incremental updation of association rules in large databases. InProceedings of the International Conference on Knowledge Discoveryand Data Mining (SIGKDD), 1997.

[44] A. Veloso, W. Meira Jr., M. B. De Carvalho, B. Possas, S. Parthasarathy,and M. J. Zaki. Mining frequent itemsets in evolving databases. InPro-ceedings of the SIAM International Conference on Data Mining, 2002.

[45] Wei Yu, Dong Xuan, and Wei Zhao. Middleware-based approach forpreventing distributed deny of service attacks. InProceedings of the IEEEMilitary Communications Conference, 2002.

[46] Yongguang Zhang and Wenke Lee. Intrusion detection in wireless ad-hocnetworks. InMobile Computing and Networking, 2000.

[47] Yongguang Zhang, Wenke Lee, and Yi-An Huang. Intrusion detectiontechniques for mobile wireless networks.Wireless Networking, 9(5):545–556, 2003.

Chapter 14

ALGORITHMS FOR DISTRIBUTEDDATA STREAM MINING

Kanishka BhaduriDept of CSEEUniversity of Maryland, Baltimore County

[email protected]

Kamalika DasDept of CSEEUniversity of Maryland, Baltimore County

[email protected]

Krishnamoorthy SivakumarSchool of EECSWashington State University

[email protected]

Hillol KarguptaDept of CSEEUniversity of Maryland, Baltimore County

[email protected]

Ran WolffDept of CSEEUniversity of Maryland, Baltimore County

[email protected]


Rong ChenSchool of EECSWashington State University

[email protected]

AbstractThe field of Distributed Data Mining (DDM) deals with the problem of an-

alyzing data by paying careful attention to the distributed computing, storage,communication, and human-factor related resources. Unlike the traditional cen-tralized systems, DDM offers a fundamentally distributed solution to analyzedata without necessarily demanding collection of the data to a single central site.This chapter presents an introduction to distributed data mining for continuousstreams. It focuses on the situations where the data observed at different loca-tions change with time. The chapter provides an exposure to the literature andillustrates the behavior of this class of algorithms by exploring two very differenttypes of techniques—one for the peer-to-peer and another for the hierarchicaldistributed environment. The chapter also briefly discusses several different ap-plications of these algorithms.

1. Introduction

A data stream is often viewed as a single source time-varying signal observedat a single receiver [13]. A data stream can be viewed as an unbounded sequence(x1, x2, ..., xn) that is indexed on the basis of the time of its arrival at the receiver.Babcock et al. [4] point out some the fundamental properties of a data streamsystem such as:

the data elements arrive continuously,

there is no limit on the total number of points in the data stream,

and the system has no control over the order in which the data elementsarrive

In some applications, the data also arrive at bursts. In other words, the sourceoccasionally generates the data at a very high rate compared to the rate used forrest of the time. Since data arrives continuously, fast one-pass algorithms areimperative for real-time query processing and data mining on streams. Manydata stream algorithms have been developed over the last decade for processingand mining data streams that arrive at a single location or at multiple locationswhereby they are sent to one location for processing needs. We referto thisscenario as the centralized data stream mining scenario. Examples of such algo-rithm include query processing [24], change detection [1][14][6], classification[3][15] and clustering [11][2]. These algorithms, however, are notapplicable

Algorithms for Distributed Data Stream Mining 311

in settings where the data, computing, and other resources are distributed andcannot or should not be centralized for a variety of reasonse.g. low bandwidth,security, privacy issues, and load balancing [5][17][23]. In many cases the costof centralizing the data can be prohibitive and the owners may have privacyconstraints of its data. In order to meet the challenges imposed by these con-straints, a new area of data mining has emerged in the last ten years known astheDistributed Data Mining (DDM)[16]. Distributed Data Mining (DDM) dealswith the problem of analyzing data by paying careful attention to the distributedcomputing, storage, communication, and human-factor related resources.Un-like the traditional centralized systems, DDM offers a fundamentally distributedsolution to analyze data without necessarily demanding collection of the datato a single central site. In this chapter we will primarily consider loosely-coupled distributed environments where each site has a private memory; thesites can operate independently and communicate by message passing overan asynchronous network. These algorithms focus on distributed computationwith due attention to minimizing resource usage (e.g. communication cost) andsatisfying application-level constraints (e.g. privacy protection). We focus ona subset of problems of DDM—Distributed Data Stream Mining—where notonly the data is distributed, but also the data is non-stationary and arriving intheform of multiple streams. These algorithms pose unique challenges themselves- the algorithms need to be efficient in computing the task, work with a localdata, compute the data mining model incrementally, and possibly communicatewith a subset of its peer-nodes to compute the result.

This chapter offers two things: (1) provide an overview of the existing tech-niques for addressing some of the stream mining problems in the distributedscenario, and (2) discuss two very different distributed data stream mining al-gorithms in greater detail in order to illustrate how these algorithms work.

The chapter is organized as follows. In Section 2 we point out the ratio-nale for the importance of the main topic of our discussion - Distributed DataStream Mining. Section 3 discusses some of the related papers in the area.In Section 4 we present alocal algorithm for data mining in a dynamic andpeer-to-peer environment. Section 5 discusses a distributed Bayesian networklearning algorithm. We conclude the chapter in Section 6.

2. Motivation: Why Distributed Data Stream Mining?

This section presents a few examples for illustrating the need for distributeddata stream mining algorithms.

Example 1: Consider a distributedsensor networkscenario where there area bunch of sensors deployed in a field. Each of these sensors measures differententitiese.g. temperature, pressure, vibration etc. Each of these sensors haslimited battery power and so developing algorithms with low communication


overhead is a must in such a setting. If the task is to monitor a global state of thesystem, one way would be to centralize all the data at each predetermined timeinstance and build a model. With this approach there are two major problems:

due to low battery power of the sensors and low bandwidth, this is pro-hibitive in a typical sensor network setting, and

the model represented in such a periodic update scheme can be wrong fora large portion of time before a new model is built (for example, this canhappen if the data distribution changes immediately after each periodicupdate)

Distributed algorithms can avoid these problems.Example 2: Another example comes from the cross domainnetwork intru-

sion monitoringparadigm. In network traffic monitoring or intrusion detectionproblem, the goal is to identify inappropriate, incorrect, or anomalous activi-ties that can potentially become a threat to the network. Existing methods ofnetwork traffic monitoring/intrusion detection often requires centralizing thesample network packets from diverse locations and analyzing them. This isprohibitive in many real-world situations such as in cross-domain intrusion de-tection schemes where different parties or companies collaborate to find theanomalous patterns. Hence, the system must be equipped with privacy preserv-ing data mining algorithms so that the patterns can be computed and sharedacross the sites without sharing the privacy-sensitive data. Also, sincethe dataarrives in the form of continuous streams (e.g.TCP/IP packets), transferring thedata to a single (trusted) location at each time instance and analyzing them is alsonot feasible. This calls for the development of distributed privacy preservingdata mining algorithms capable of being deployed in streaming environments.The Department of Homeland Security (DHS) is currently funding a number ofprojects in this related area (see “http://www.agnik.com/DHSSBIR.html" forsuch a project).

In the next section we present a survey of some of the existing distributeddata stream mining algorithms.

3. Existing Distributed Data Stream Mining Algorithms

There exists a plethora of work in the area of distributed data stream mining.The existing literature provides an excellent starting point for our main topicof discussion in this chapter. Not only have the distributed data mining anddatabases community contributed to the literature, a bulk of the work alsocomes from the wireless and sensor networks community. In this section wediscuss some of the related papers with pointers for further reading.

Computation of complex functions over the union of multiple of streams hasbeen studied widely in the stream mining literature. Gibbons et al. [10] presents


the idea of doing coordinated sampling in order to compute simple functionssuch as the total number of ones in the union of two binary streams. They havedeveloped a new sampling strategy to sample from the two streams and haveshown that their sampling strategy can reduce the space requirement for sucha computation fromΩ(n) to log(n), wheren is the size of the stream. Theirtechnique can easily be extended to the scenario where there are more thantwostreams. The authors also point out that this method would work even if thestream is non-binary (with no change in space complexity).

Much work has been done in the area of query processing on distributeddatastreams. Chen et al. [7] has developed a system ‘NiagaraCQ’ which allowsanswering continuous queries in large scale systems such as the internet. Insuch systems many of the queries are similar. So a lot of computation, com-munication and I/O resources can be saved by properly grouping the similarqueries. NiagaraCQ achieves the same goal. Their grouping scheme is incre-mental and they use an adaptive regrouping scheme in order to find the optimalmatch between a new query and the group to which the query should be placed.If none of these match, then a new query group is formed with this query. Thepaper does not talk about reassignment of the existing queries into the newlyformed groups, rather leaves it as a future work.

An different approach has been described in [23]. The distributed modeldescribed there has nodes sending streaming data to a central node whichisresponsible for answering the queries. The network links near the central nodebecomes a bottleneck as soon as the arrival rate of data the becomes too high.In order to avoid that, the authors propose installing filters which restrict thedata transfer rate from the individual nodes. NodeO installs a filter of widthWO and of range [LO, HO]. WO is centered around the most recent value ofthe objectV (LO = V −WO

2 andHO = V + WO2 ). Now the node does not send

updates ifV is inside the rangeLO ≤ V ≤ HO; otherwise it sends updates tothe central node and recenters the boundsLO andHO. This technique providesthe answers to queries approximately and works in the circumstances wherewedo not require the exact answer to the queries. Since in many cases the user canprovide the query precision that is necessary, the filters can be made to workafter setting the bounds based on this user input.

The sensor network community provides a rich literature on the streamingalgorithms. Since the sensors are deployed in hostile terrains, one of the mostfundamental task aims at developing a general framework for monitoring thenetwork themselves. A similar idea has been presented in [27]. This paperpresents a general framework and shows how decomposable functionslike min,max, average, count and sum can be computed over such an architecture. Thearchitecture is highlighted by three tools that the authors calldigests, scansanddumps. Digestsare the network parameters (e.g.count of the number of nodes)that are computed either continuously, periodically or in the event of a trigger.


Scansare invoked when thedigestsreport a problem (e.g.a sudden drop in thenumber of nodes) to find out the energy level throughout the network. Thesetwo steps can guide a network administrator towards the location of the faultwhich can be debugged using thedumps(dump all the data of a single or fewof the sensors). The other thing that this paper talks about is the distributedcomputing of some aggregate functions (mean, max, count etc.). Since all thesefunctions are decomposable, the advantage is in-network aggregation ofpartialresults up a tree. The leaf does not need to send all its data to the root andin this way vital savings can be done in terms of communication. The majorconcern though is maintaining this tree structure in such a hostile and dynamicenvironment. Also this technique would fail for numerous non-decomposablefunctionse.g.median, quantile etc.

The above algorithm describes a way of monitoring the status of the sensornetwork itself. There are many data mining problems that need to be addressedin the sensor network scenario. Such an algorithm for multi-target classificationin the sensor networks has been developed by Kotecha et al. [17] Eachnodemakes local decisions and these decisions are forwarded to a single nodewhichacts as the manager node. The maximum number of targets is known in apriori,although the exact number of targets is not known in advance. Nodes thataresufficiently apart are expected to provide independent feature vectors for thesame target which can strengthen the global decision making. Moreover, foran optimal classifier, the number of decisions increases exponentially with thenumber of targets. Hence the authors propose the use of sub-optimal linearclassifiers. Through real life experiments they show that their suboptimal clas-sifiers perform as well as the optimal classifier under mild assumptions. Thismakes such a scheme attractive for low power, low bandwidth environments.

Frequent items mining in distributed streams is an active area of research[22]. There are many variants of the problem that has been proposed inthe literature (refer to [22] for a description). Generally speaking therearem streamsS1, S2, .., Sm. Each stream consists of items with time stamps< di1, ti1 >,< di2, ti2 >, etc. LetS be the sequence preserving union ofall the streams. If an itemi ∈ S has a countcount(i) (the count may beevaluated by an exponential decay weighting scheme). The task is to outputan estimatecount(i) of count(i) whose frequency exceeds a certain thresh-old. Each node maintains a precision threshold and outputs only those itemsexceeding the precision threshold. As two extreme cases, the threshold can beset to very low (≈ 0) or very high (≈ 1). In the first case, all the intermediatenodes will send everything without pruning resulting in a message explosionatthe root. In the second case, the intermediate nodes will send a low number ofitems and hence no more pruning would be possible at the intermediate nodes.So the precision selection problem is crucial for such an algorithm to producemeaningful results with low communication overhead. The paper presents a


number of ways to select the precision values (they call it precision gradients)for different scenarios of load minimization.

In the next two sections we focus our attention on two particular distributeddata stream mining algorithms. The first algorithm that we present in Section 4works in large scale distributed (p2p) systems. The Bayesian network learningalgorithm (Section 5) can be used to learn a Bayesian network model when thedata is distributed and arriving in the form of data streams.

4. A local algorithm for distributed data stream mining

Having presented the general paradigm for distributed data stream miningand some existing algorithms, we now shift our attention to two specific algo-rithms that have been designed for the distributed streaming environment. Thealgorithm presented in this section works in a large-scale and dynamic envi-ronmente.g. a peer-to-peer environment, sensor networks and the like. Whilethis algorithm is truly asynchronous and guarantees eventual correct result, it isrestricted to the set of problems where we can define a threshold within whichwe want our result. As an example let us assume that there areN peers in anetwork and each peer has a bitbi (0 or 1). The task is to find out if

∑Ni=1 bi > ǫ,

whereǫ is a global parameter. There exists an algorithm [26] in the literaturethat can do this in the streaming scenario whereby each peer needs to contactonly a subset of the nodes in its neighborhood. The algorithm that we discussin this section shares the same philosophy – although it can perform a differentset of data mining tasks. Going back to our example, if the problem is modifieda little bit and we want to find out the value of

∑Ni=1 bi exactly, to the best

of the authors’ knowledge there exists no algorithm that can do this withoutcollecting all the data. Hence the second algorithm that we have selected forour discussion builds a model incrementally. It first fits a model to the localdata and determines the fitness of the model. If the model is not good enough,a sample of its data is taken and along with its neighbors’ data used to build amore complex model (details in Section 5).

4.1 Local Algorithms : definition

Before we formally start the discussion of our algorithm let us define whatwe mean bylocal algorithms since this is the term that will be used throughoutthe rest of this section.

Local algorithms are ones whose resource consumption is sometimes inde-pendent of system size. That is, an algorithm for a given problem is local if,assuming a static network and data, there exists a constantc such that for anysize of the systemN there are instances of the problem such that the time,CPU, memory, and communication requirements per peer are smaller thanc.Therefore, the most appealing property of local algorithms is their extreme


scalability. Local algorithms have been presented in the context of graph algo-rithms [21][19], and recently for data mining in peer-to-peer networks [26][18][25]. Local algorithms guarantee eventual correctness – when the computationterminates each peer computes the same result it would have computed giventhe entire data. The main advantage of this definition is that there already existsalgorithms that follow this definition of locality. On the contrary, this defini-tion is disadvantageous in the sense that we do not specify precisely how manyinstances of the problem we need (that can be solved locally) in order to deeman algorithm local.

There are certain characteristics that typifylocal algorithms. These includebut are not limited to the ability of each node to compute the result using infor-mation gathered from just a few nearby neighbors, the ability to calculate theresult in-network rather than collect all of the data to a central processor (whichwould quickly exhaust bandwidth), and the ability to locally prune redundantor duplicate computations. Needless to mention that these characteristics areextremely important in most large-scale distributed applications.

We are now in a position to present alocal Distributed Data Stream Miningalgorithm.

4.2 Algorithm details

The algorithm that we are going to discuss in this section can be used formonitoring a data model in the distributed and streaming environment. Asalready mentioned, this algorithm requires a threshold predicate. It also assumesthat a tree topology has been laid over the network structure. Before we presentthe details of the algorithm, let us present an overview. LetX be a datasetwhich is a union of several data streams. Letf [X] be a function that we wantto compute onX. Since it is not always feasible to store and process the entirestream, we select a subset of the stream in a given time frame. We denote thisby Xt. Our main task is to monitor whetherf [Xt] > ǫ, whereǫ is a user definedthreshold. For example, iff is theaverage we can bound the average of a setof vectors within this threshold. The details of this algorithm can be found in[25].

In the actual algorithm description, letP1, P2, ..., Pn ben peers connectedby an arbitrary communication tree such that the set ofP ′

is neighborsNi isknown toPi. Each peer is supplied a stream of points fromRd, whered is thedimension of the problem. The local average of the points at timet is denotedby Sit. Each peer maintains the following vectors (each of these vectors isweighted, we omit it in our discussion here for clarity and simplicity):

Xi : the current estimate of the global average orXN (known as theknowledge of Pi)

Xi,j : the last vector send by peerPi to Pj


Xj,i : the last vector received by peerPi from Pj

Xi∩j : theagreement of Pi andPj (calculated fromXi,j andXj,i)

Xi\j : thekept knowledge of Pi andPj (calculated fromXi andXi∩j)

Initially Xi is initialized toSit. The first time a data is exchanged betweenPi andPj , it sets the vectorsXi,j andXj,i. Whenever new points arrive, aneighbor is removed or a new neighbor comes in,Sit changes. This triggersthe changes in the local vectorsXi, Xi\j and/orXi∩j. The global data mining

problem is to find out if||Pn

i=1 Xi

n (= XN )|| < ǫ. We will now present differentconditions onXi, Xi∩j andXi\j that will allow us to decide if||XN|| < ǫ or||XN|| > ǫ. There are two main lemmas that can be used in order to determinethis. The first lemmas states that if forevery peerPi andeachneighborPj

of Pi it holds that both||Xi∩j|| < ǫ and ||Xi\j|| < ǫ, then||XN|| < ǫ. Thesecond lemma states that givend unit vectors,u1 . . . ud if for a specific one ofthemu everypeerPi andeachof its neighborsPj haveui ·Xi∩j ≥ ǫ and eitherui ·Xi\j ≥ ǫ or ui ·Xi = ui ·Xi∩j then‖XN‖ ≥ ǫ.

Given these two lemmas, we can now describe the algorithm that decideswhether the L2 norm of the average vectorXN is within a threshold or not. Forsimplicity we will describe the conditions assumingR2, the extension to thehigher dimension is obvious. For a vectorXi = (xa, ya) ∈ R2, the L2-norm isgiven by

√x2

a + y2a

Now if we want to test if the L2-norm is greater thanǫ or not, we want to checkif the following condition is true:

√x2

a + y2a > ǫ

which is, in essence, checking if||Xi|| lies inside or outside a circle of radiusǫ.Consider Figure 14.1. The circle shown in the figure is a circle of radiusǫ.

The algorithm needs to check if||Xi|| is inside or outside the circle. In orderto do this, it approximates the circles with a set of tangent lines as shown inthe figure. The problem of checking inside the circle is relatively simpler - ifa peer determines that its local||Xi|| < ǫ and for every neighbor||Xi\j|| < ǫand ||Xi∪j|| < ǫ, then by the first lemma the peer knows that||XN || < ǫ aswell. Hence the local estimate||Xi|| of ||XN || is indeed the correct one andthus no message needs to be exchanged. If on the other hand a peer violatesthis condition, it needs to send a message toeachviolating neighbor.

If on the other hand a peerPi determines that||Xi|| is outside the polygon,it needs to check the conditions of the second lemma. Note that if two peerssimply say that||Xi|| > ǫ still the average can be lesser thanǫ (when the two


A

B

C

D

Figure 14.1. (A) the area inside anǫ circle. (B) Seven evenly spaced vectors -u1 . . .u7.(C) The borders of the seven halfspacesui · x ≥ ǫ define a polygon in which the circle iscircumscribed. (D) The area between the circle and the union of half-spaces.

X ′is are in the opposite directions). So we need to check the conditions based

on the tangent lines.Pi needs to find the first tangent plane such that a neighborPj claims that the point is outside the polygon. More precisely, it needs tocheck if there exists someu∗ such thatui · Xi∩j ≥ ǫ. Now if Pi enters astate whereby||Xi|| > ǫ, it does not need to send a message to its neighborsif they support its claim. Formally speakingPi needs to send a message toPj

wheneveru∗ ·Xi∩j < ǫ or u∗ ·Xi\j < ǫ, else not.The last case is when theXi lies in the region between the circle and the

polygon (the peer would findu∗ to benil. In that case the peer has to resort toflooding. Also note that, this area can be made arbitrarily smaller using morenumber of tangent lines - the trade-off is an increased computation for eachpeer.

4.3 Experimental results

The above algorithm exhibits excellent accuracy and scalability. For eachexperiment we introduce new data distribution at every predefined numberofsimulator ticks. We start with a particular data distribution and chooseǫ suchthat the average vector is within theǫ-range. As shown in Figure 14.2, the lowerset of bars show the number of peers that report if the average vector islessthanǫ. After a fixed number of simulator ticks we change the data distribution


0 1000 2000 3000 4000 500070

75

80

85

90

95

100

Number of Peers

% o

f cor

rect

pee

rs ||XN

||<ε

||XN

||>ε

Figure 14.2. Quality of the algorithm with increasing number of nodes

0 1000 2000 3000 4000 50000.4

0.45

0.5

0.55

0.6

0.65

Number of Peers

Mes

sage

s pe

r pe

er p

er 1

/L

Stationary periodTransitional and stationary period

Figure 14.3. Cost of the algorithm with increasing number of nodes

and now, the||Xi||’s are no more inside the circle and so we have the upper setof bars reporting the case when||Xi|| is outside the circle. As we see from thefigure, the accuracy remains almost constant. Similarly, Figure 14.3 shows thecost of the simple L2 algorithm. In order to eliminate a message explosion, wehave used a leaky bucket mechanism and a peer is not allowed to send morethanone message peer leaky bucket duration. Our unit of measurement is messagesper peer per unit leaky bucket time (L). It can be easily seen that the number ofmessages per unit time remains constant - this typifies local algorithms - theyare highly scalable.


4.4 Modifications and extensions

We call the above operation mode of the algorithm the ‘openloop mode’. Itis useful for bounding the average vector within a threshold. If we ‘close theloop’ we can use the above simple algorithm to monitor data mining results(e.g. eigenstates of the data, k-means of the data etc). Generally, we needto define ourXi appropriately (we provide examples later) and the open loopalgorithm would raise a flag if||Xi|| > ǫ. We can potentially use this flag tocollect statistics of the data and build a new model. This model can then beshipped to the peers and this process can go on. If the data stops changing, theguarantee of the L2 algorithm is that||Xi|| > ǫ for every peer iff||XN|| > ǫ.Hence for long durations of stationary period we would expect to see localbehavior of the algorithm. Only when the data distribution changes, the alertflag would be raised. We can then collect sample of the data from the networkand use standard aggregation techniques such as convergecast to propagate thedata up the tree to build a new model. As a simple example we show how thisalgorithm is possible to monitor the eigenstates of the data. We defineXi asthe following:

Xi = A× Sit − θ × Sit

whereA andθ are the principal eigenvector and the eigenvalue of the data. Achange in the data distribution(Sit) would change theXi and set up an alertflag. This change might trigger a series of events :

Pi will communicate with its neighbors and try to solve the problem usingthe open loop algorithm

If the alert flag has been there for a long time,Pi sends its data to itsparent in the tree. If the alert flag is not there any more, it is consideredto be a false alarm

If a peer receives data from all of its children, it sends the data to itsparent. The root after receiving the data from everybody computes thenew eigenstates and notifies all the peers about this new model

The above algorithm is what we call the ‘closed loop’ algorithm. Since themodel is built solely on best effort, it may be the case that the model is no longergood enough once it reaches all the peers. All that the algorithm will do is torestart the process once again and build a more up-to-date model. A similaralgorithm for monitoring the k-means can be described by simply changingXi

to Sit − ci, whereci is the current centroid of the global data.In the next section we present an algorithm that can be used for learninga

model from the data in the streaming scenario.


5. Bayesian Network Learning from Distributed DataStreams

This section discusses an algorithm for Bayesian Model learning. In manyapplications the goal is to build a model that represents the data. In the previoussection we saw how such a model can be build when the system is providedwith a threshold predicate. If, however, we want to build an exact globalmodel, development of local algorithms sometimes becomes very difficult, ifnot impossible. In this section we draw the attention of the reader to a class ofproblems which needs global information to build a data model (e.g.K-means,Bayesian Network,etc). The crux of these types of algorithms lies in building alocal model, identifying the goodness of the model and then co-ordinating witha central site to update the model based on global information. We describehere a technique to learn a Bayesian network in a distributed setting.

Bayesian network is an important tool to model probabilistic or imperfectrelationship among problem variables. It gives useful information aboutthemutual dependencies among the features in the application domain. Such in-formation can be used for gaining better understanding about the dynamicsofthe process under observation. It is thus a promising tool to model customerusage patterns in web data mining applications, where specific user preferencescan be modeled as in terms of conditional probabilities associated with the dif-ferent features. Since we will shortly show how this model can be built onstreaming data, it can potentially be applied to learn Bayesian classifiers indistributed settings. But before we delve into the details of the algorithm wepresent what a Bayesian Network (or Bayes’ Net or BN in short) is, and thedistributed Bayesian learning algorithm assuming a static data distribution.

A Bayesian network (BN)is a probabilistic graph model. It can be defined asa pair (G, p), whereG = (V, E) is a directed acyclic graph (DAG). Here,V is thenode set which represents variables in the problem domain andE is the edge setwhich denotes probabilistic relationships among the variables. For a variableX ∈ V, a parent ofX is a node from which there exists a directed link toX.Figure 14.4 is a BN called the ASIA model (adapted from [20]). The variablesare Dyspnoea, Tuberculosis, Lung cancer, Bronchitis, Asia, X-ray, Either, andSmoking. They are all binary variables. The joint probability distribution ofthe set of variables inV can be written as a product of conditional probabilitiesas follows:

P (V) =∏

X∈V

P (X | pa(X)). (14.1)

In Equation (14.1)pa(X) denotes the set of parents of nodeX. The set ofconditional distributionsP (X | pa(X)), X ∈ V are called the parameters ofa Bayesian network. If variableX has no parents, thenP (X | pa(X)) = P (X)is the marginal distribution ofX.


A

T

X

E

D

L B

S

Figure 14.4. ASIA Model

Two important issues in using a Bayesian network are: (a) learning a Bayesiannetwork and (b) probabilistic inference. Learning a BN involves learningthestructure of the network (the directed graph), and obtaining the conditional prob-abilities (parameters) associated with the network. Once a Bayesian networkisconstructed, we usually need to determine various probabilities of interest fromthe model. This process is referred to as probabilistic inference.

In the following, we discuss a collective approach to learning a Bayesiannetwork that is specifically designed for a distributed data scenario.

5.1 Distributed Bayesian Network Learning Algorithm

The primary steps in our approach are:(a) Learn local BNs (local model) involving the variables observed at each sitebased on local data set.(b) At each site, based on the local BN, identify the observations that aremostlikely to be evidence of coupling between local and non-local variables. Trans-mit a subset of these observations to a central site.(c) At the central site, a limited number of observations of all the variables arenow available. Using this to learn a non-local BN consisting of links betweenvariables across two or more sites.(d) Combine the local models with the links discovered at the central site toobtain a collective BN.

The non-local BN thus constructed would be effective in identifying asso-ciations between variables across sites, whereas the local BNs would detect


associations among local variables at each site. The conditional probabilitiescan also be estimated in a similar manner. Those probabilities that involveonly variables from a single site can be estimated locally, whereas the onesthat involve variables from different sites can be estimated at the central site.Same methodology could be used to update the network based on new data.First, the new data is tested for how well it fits with the local model. If thereis an acceptable statistical fit, the observation is used to update the local condi-tional probability estimates. Otherwise, it is also transmitted to the central siteto update the appropriate conditional probabilities (of cross terms). Finally,acollective BN can be obtained by taking the union of nodes and edges of thelocal BNs and the nonlocal BN and using the conditional probabilities from theappropriate BNs. Probabilistic inference can now be performed based on thiscollective BN. Note that transmitting the local BNs to the central site wouldinvolve a significantly lower communication as compared to transmitting thelocal data.

It is quite evident that learning probabilistic relationships between variablesthat belong to a single local site is straightforward and does not pose any ad-ditional difficulty as compared to a centralized approach (This may not be truefor arbitrary Bayesian network structure. A detailed discussion of this issue canbe found in [9]). The important objective is to correctly identify the couplingbetween variables that belong to two (or more) sites. These correspond totheedges in the graph that connect variables between two sites and the conditionalprobability(ies) at the associated node(s). In the following, we describeourapproach to selecting observations at the local sites that are most likely to beevidence of strong coupling between variables at two different sites. The keyidea of our approach is that the samples that do not fit well with the local mod-els are likely to be evidence of coupling between local and non-local variables.We transmit these samples to a central site and use them to learn a collectiveBayesian network.

5.2 Selection of samples for transmission to global site

For simplicity, we will assume that the data is distributed between two sitesand will illustrate the approach using the BN in Figure 14.4. The extension ofthis approach to more than two sites is straightforward. Let us denote byAandB, the variables in the left and right groups, respectively, in Figure 14.4.We assume that the observations forA are available at site A, whereas theobservations forB are available at a different site B. Furthermore, we assumethat there is a common feature (“key” or index) that can be used to associate agiven observation in site A to a corresponding observation in site B. Naturally,V = A ∪ B.


At each local site, a local Bayesian network can be learned using only samplesin this site. This would give a BN structure involving only the local variablesat each site and the associated conditional probabilities. LetpA(.) andpB(.)denote the estimated probability function involving the local variables. Thisis the product of the conditional probabilities as indicated by Equation (14.1).SincepA(x), pB(x) denote the probability or likelihood of obtaining observa-tionx at sites A and B, we would call these probability functions the likelihoodfunctionslA(.) andlB(.), for the local model obtained at sites A and B, respec-tively. The observations at each site are ranked based on how well it fitsthelocal model, using the local likelihood functions. The observations at site Awith large likelihood underlA(.) are evidence of “local relationships” betweensite A variables, whereas those with low likelihood underlA(.) are possible evi-dence of “cross relationships” between variables across sites. LetS(A) denotethe set of keys associated with the latter observations (those with low likelihoodunderlA(.)). In practice, this step can be implemented in different ways. Forexample, we can set a thresholdρA and if lA(x) ≤ ρA, thenx ∈ SA. The sitesA and B transmit the set of keysSA, SB, respectively, to a central site, wherethe intersectionS = SA ∩SB is computed. The observations corresponding tothe set of keys inS are then obtained from each of the local sites by the centralsite.

In a sense, our approach to learning the cross terms in the BN involves aselective sampling of the given dataset that is most relevant to the identificationof coupling between the sites. This is a type ofimportance sampling, where weselect the observations that have high conditional probabilities correspondingto the terms involving variables from both sites. Naturally, when the valuesof the different variables (features) from the different sites, corresponding tothese selected observations are pooled together at the central site, we can learnthe coupling links as well as estimate the associated conditional distributions.These selected observations will, by design, not be useful to identify the linksin the BN that are local to the individual sites.

Having discussed in detail the distributed Bayesian learning algorithm (as-suming a static data), we can now proceed with our discussion on how thisalgorithm can be modified to work with evolving data.

5.3 Online Distributed Bayesian Network Learning

The proposed collective approach to learning a BN is well suited for a sce-nario with multiple data streams. Suppose we have an existing BN model,which has to be constantly updated based on new data from multiple streams.For simplicity, we will consider only the problem of updating the BN param-eters, assuming that the network structure is known. As in the case of batchmode learning, we shall use techniques for online updating of BN parameters


for centralized data. In the centralized case, there exists simple techniques forparameter updating for commonly used models like the unrestricted multino-mial model. For example, let us denote bypijl = Pr(xi = l | paxi = j),the conditional probability at nodei, given the parents of nodei. We can thenobtain the estimatepijl(k+ 1) of pijl at stepk+ 1 as follows (see [12, Section5]):

pijl(k + 1) =αijl(k) +Nijl(k + 1)

αij(k) +Nij(k + 1), (14.2)

whereαij(k) =∑

l αijl(k) andNij(k + 1) =∑

lNijl(k + 1). In equation14.2,Nijl(k + 1) denotes the number of observations in the dataset obtainedat timek + 1 for which,xi = l andpaxi = j, and we can setαijl(k + 1) =αijl(k) +Nijl(k+ 1). Note thatNijl(k) are a set of sufficient statistics for thedata observed at timek.

For online distributed case, parameters for local terms can be updated usingthe same technique as in a centralized case. Next, we need to update theparameters for the cross-links, without transmitting all the data to a central site.Again we choose the samples with low likelihood in local sites and transmitthem to a central site. This is then used to update the cross-terms at the centralsite. We can summarize our approach by the following steps:

1 Learn an initial collective Bayesian network from the first dataset ob-served (unless a prior model is already given). Thus we have a local BNat each site and a set of cross-terms at the central site.

2 At each stepk:

Update the local BN parameters at each site using equation 14.2.

Update the likelihood threshold at each local site, based on thesample mean value of the observed likelihoods. This is the thresholdused to determine if a sample is to be transmitted to a central site(see Section 5.2).

Transmit the low likelihood samples to a central site.

Update the parameters of the cross-terms at the central site.

Combine the updated local terms and cross terms to get an updatedcollective Bayesian network.

3 Incrementk and repeat step (2) for the next set of data.

This section concludes our discussion on the distributed streaming Bayesianlearning algorithm. In the following, we point out some of the experimentalverifications of the proposed algorithm.


5.4 Experimental Results

We tested our approach on two different datasets. A small real web log datasetwas used for batch mode distributed Bayesian learning. This was used to testboth structure and parameter learning. We also tested our online distributedlearning approach on a simulated web log dataset. Extensive examples forbatch mode learning (using both real and simulated web log data), demonstrat-ing scalability with respect to number of distributed sites have been presentedelsewhere [8, 9]. In the following, we present our results for BN parameterlearning using online data streams.

We illustrate the results of online BN parameter learning assuming the net-work structure is known. We use the model shown in Figure 14.5. The 32 nodesin the network are distributed among four different sites. Nodes 1, 5, 10,15,16, 22, 23, 24, 30, and 31 are in site A. Nodes 2, 6, 7, 11, 17, 18, 25,26, and 32are in site B. Nodes 3, 8, 12, 19, 20, and 27 are in site C. Nodes 4, 9, 13,14, 21,28, and 29 are in site D. A dataset with 80,000 observations was generated. Weassumed that at each stepk, 5,000 observations of the data are available (for atotal of 16 steps).

We denote byBbe, the Bayesian network obtained by using all the 80,000samples in batch mode (the data is still distributed into four sites). We denoteby Bol(k), the Bayesian network obtained at stepk using our online learningapproach and byBba(k), the Bayesian network obtained using a regular batchmode learning, but using only data observed upto timek. We choose threetypical cross terms (nodes 12, 27, and 28) and compute the KL distance betweenthe conditional probabilities to evaluate the performance of online distributedmethod. The results are depicted in Figure 14.6.

Figure 14.6 (left) shows the KL distance between the conditional probabilitiesfor the networksBol(k) andBbe for the three nodes. We can see that theperformance of online distributed method is good, with the error (in termsof KL distance) dropping rapidly. Figure 14.6 (right) shows the KL distancebetween the conditional probabilities for the networksBba(k) andBol for thethree nodes. We can see that the performance of a network learned using ouronline distributed method is comparable to that learned using a batch modemethod, with the same data.

6. Conclusion

In this chapter we have surveyed the field of distributed data stream mining.We have presented a brief survey of field, discussed some of the distributeddata stream algorithms, their strengths and weaknesses. Naturally, we haveelucidated one slice through this field - the main topic of our discussion in this


1 2 3 4

5 6 7 8 9

10 11 12 13

14

15 16 17 18 19 20 21

22 23 24 25 26 27 28 29

30 31 32

Figure 14.5. Bayesian network for online distributed parameter learning

chapter was algorithms for distributed data stream mining. Many importantareas such as system development, human-computer interaction, visualizationtechniques and the like in the distributed and streaming environment were leftuntouched due to lack of space and limited literature in the areas.

We have also discussed in greater detail two specific distributed data streammining algorithms. In the process we wanted to draw the attention of the readersto an emerging area of distributed data stream mining, namely data stream min-ing in large-scale peer-to-peer networks. We encourage the reader toexploredistributed data stream mining in general. All the fields - algorithm develop-ment, systems development and developing techniques for human-computerinteraction are still at a very early stage of development. On an ending note,thearea of distributed data stream mining offers plenty of room for developmentboth for the pragmatically and theoretically inclined.

Acknowledgments

The authors thank the U.S. National Science Foundation for support throughgrants IIS-0329143, IIS-0350533, CAREER award IIS-0093353, and NASAfor support under Cooperative agreement NCC 2-1252. The authors wouldalso like to thank Dr. Chris Gianella for his valuable input.


0 2 4 6 8 10 120

0.01

0.02

0.03

0.04

0.05

0.06

step k

KL

dist

ance

node 12node 27node 28

0 2 4 6 8 10 120

0.01

0.02

0.03

0.04

0.05

0.06

step k

KL

dist

ance

node 12node 27node 28

Figure 14.6. Simulation results for online Bayesian learning: (left) KL distance between theconditional probabilities for the networksBol(k) andBbe for three nodes (right) KL distancebetween the conditional probabilities for the networksBol(k) andBba for three nodes

References

[1] C Aggarwal. A framework for diagnosing changes in evolving datastreams. InACM SIGMOD’03 International Conference on Managementof Data, 2003.

[2] C. Aggarwal, J. Han, J. Wang, and P. Yu. A framework for clusteringevolving data streams. InVLDB conference, 2003.

[3] C. Aggarwal, J. Han, J. Wang, and P. S. Yu. On demand classification ofdata streams. InKDD, 2004.

[4] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Modelsand issues in data stream systems. InIn Principles of Database Systems(PODS’02), 2002.

[5] B. Babcock and C. Olston. Distributed top-k monitoring. InACM SIG-MOD’03 International Conference on Management of Data, 2003.

[6] S. Ben-David, J. Gehrke, and D. Kifer. Detecting change in data streams.In VLDB Conference, 2004.

[7] J. Chen, D. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: a scalable contin-uous query system for Internet databases. InACM SIGMOD’00 Interna-tional Conference on Management of Data, 2000.

[8] R. Chen, K. Sivakumar, and H. Kargupta. An approach to online bayesianlearning from multiple data streams. InProceedings of the Workshopon Ubiquitous Data Mining (5th European Conference on Principlesand Practice of Knowledge Discovery in Databases), Freiburg, Germany,September 2001.

[9] R. Chen, K. Sivakumar, and H. Kargupta. Collective mining of bayesiannetworks from distributed heterogeneous data.Knowledge and Informa-tion Systems, 6:164–187, 2004.


[10] P. Gibbons and S. Tirthapura. Estimating simple functions on the unionof data streams. InACM Symposium on Parallel Algorithms and Archi-tectures, 2001.

[11] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering datastreams. InIEEE Symposium on FOCS, 2000.

[12] D. Heckerman. A tutorial on learning with Bayesian networks. TechnicalReport MSR-TR-95-06, Microsoft Research, 1995.

[13] M. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on datastreams. Technical Report TR-1998-011, Compaq System Research Cen-ter, 1998.

[14] G. Hulten, L. Spencer, and P. Domingos. Mining time-changing datastreams. InSIGKDD, 2001.

[15] R. Jin and G. Agrawal. Efficient decision tree construction on streamingdata. InSIGKDD, 2003.

[16] H. Kargupta and K. Sivakumar.Existential Pleasures of Distributed DataMining. Data Mining: Next Generation Challenges and Future Directions.AAAI/MIT press, 2004.

[17] J. Kotecha, V. Ramachandran, and A. Sayeed. Distributed multi-targetclassification in wireless sensor networks.IEEE Journal of Selected Areasin Communications (Special Issue on Self-Organizing Distributed Collab-orative Sensor Networks), 2003.

[18] D. Krivitski, A. Schuster, and R.Wolff. A local facility location algorithmfor sensor networks. InProc. of DCOSS’05, 2005.

[19] S. Kutten and D. Peleg. Fault-local distributed mending. InProc. of theACM Symposium on Principle of Distributed Computing (PODC), pages20–27, Ottawa, Canada, August 1995.

[20] S. L. Lauritzen and D. J. Spiegelhalter. Local computations with probabil-ities on graphical structures and their application to expert systems (withdiscussion).Journal of the Royal Statistical Society, series B, 50:157–224,1988.

[21] N. Linial. Locality in distributed graph algorithms.SIAM Journal ofComputing, 21:193–201, 1992.

[22] A. Manjhi, V. Shkapenyuk, K. Dhamdhere, and C. Olston. Finding (re-cently) frequent items in distributed data streams. InInternational Con-ference on Data Engineering (ICDE’05), 2005.

REFERENCES 331

[23] C. Olston, J. Jiang, and J. Widom. Adaptive filters for continuous queriesover distributed data streams. InACM SIGMOD’03 International Con-ference on Management of Data, 2003.

[24] J. Widom and R. Motwani. Query processing, resource management, andapproximation in a data stream management system. InCIDR, 2003.

[25] R. Wolff, K. Bhaduri, and H. Kargupta. Local L2 thresholding based datamining in peer-to-peer systems. InProceedings of SIAM InternationalConference in Data Mining (SDM), Bethesda, Maryland, 2006.

[26] R. Wolff and A. Schuster. Association rule mining in peer-to-peer systems.In Proceedings of ICDM’03, Melbourne, Florida, 2003.

[27] J. Zhao, R. Govindan, and D. Estrin. Computing aggregates for monitoringwireless sensor networks. InProceedings of the First IEEE InternationalWorkshop on Sensor Network Protocols and Applications, 2003.

Chapter 15

A SURVEY OF STREAM PROCESSINGPROBLEMS AND TECHNIQUESIN SENSOR NETWORKS

Sharmila Subramaniam, Dimitrios GunopulosComputer Science and Engineering Dept.University of California at RiversideRiverside, CA 92521

sharmi,[email protected]

Abstract Sensor networks comprise small, low-powered and low-cost sensing devices thatare distributed over a field to monitor a phenomenon of interest. The sensor nodesare capable of communicating their readings, typically through wireless radio.Sensor nodes produce streams of data, that have to be processed in-situ, by thenode itself, or to be transmitted through the network, and analyzed offline.Inthis chapter we describe recently proposed, efficient distributed techniques forprocessing streams of data collected with a network of sensors.

Keywords: Sensor Systems, Stream Processing, Query Processing, Compression, Tracking

Introduction

Sensor networks are systems of tiny, low-powered and low-cost devices dis-tributed over a field to sense, process and communicate information about theirenvironment. The sensor nodes in the systems are capable of sensing a phe-nomenon and communicating the readings through wireless radio. The memoryand the computational capabilities of the nodes enable in-site processing of theobservations. Since the nodes can be deployed at random, and can be usedto collect information about inaccessible remote domains, they are consideredas very valuable and attractive tools for many research and industrial applica-tions. Motesis one example of sensor devices developed by UC Berkeley andmanufactured by Crossbow Technology Inc. [13].


Sensor observations form streams of data that are either processed in-situor communicated across the network and analyzed offline. Examples of sys-tems producing streams of data include environmental and climatological ([39])monitoring where phenomena such as temperature, pressure and humidity aremeasured periodically (at various granularities). Sensors deployed in buildingand bridges relay measurements of vibrations, linear deviation etc. for moni-toring structural integrity ([51]). Seismic measurements, habitat monitoring ([7, 38]), data from GPS enabled devices such as cars and phones and surveillancedata are further examples. Surveillance systems may include sophisticated sen-sors equipped with cameras and UAVs but nevertheless they produce streamsof videos or streams of events. In some applications, raw data is processed inthe nodes to detectevents, defined as some suitable function on the data, andonly the streams of events are communicated across the network.

The focus of this chapter is to describe recently proposed, efficient distributedtechniques for processing streams of data that are collected from a sensor net-work.

1. Challenges

Typically a large number of sensors nodes are distributed spanning wideareas and each sensor produces large amount of data continuously as obser-vations. For example, about10, 000 traffic sensors are deployed in Californiahighways to report traffic status continuously. The energy source forthe nodesare either AA batteries or solar panels that are typically characterized by lim-ited supply of power. In most applications, communication is considered as thefactor requiring the largest amount of energy, compared to sensing ([41]). Thelongevity of the sensor nodes is therefore drastically reduced when theycom-municate raw measurements to a centralized server for analysis. Consequentlydata aggregation, data compression, modeling and online querying techniquesneed to be applied in-site or in-network to reduce communication across thenetwork. Furthermore, limitations of computational power and inaccuracy andbias in the sensor readings necessitate efficient data processing algorithms forsensor systems.

In addition, sensor nodes are prone to failures and aberrant behaviors whichcould affect network connectivity and data accuracy severely. Algorithms pro-posed for data collection, processing and querying for sensor systemsare re-quired to be robust and fault-tolerant to failures. Network delays present insensor systems is yet another problem to cope up with in real-time applications.

The last decade has seen significant advancement in the development ofalgorithms and systems that are energy aware and scalable with respect tonetworking, sensing, communication and processing. In the following, we

A Survey of Stream ProcessingProblems and Techniquesin Sensor Networks 335

describe some of the interesting problems in data processing in sensor networksand give brief overview of the techniques proposed.

2. The Data Collection Model

We assume a data collection model where the set of sensors deployed overafield communicate over a wireless ad-hoc network. This scenario is typicallyapplicable when the sensors are small and many. The sensors are deployedquickly, leaving no or little time for a wired installation. Nevertheless, thereare also many important applications where expensive sensors are manuallyinstalled. A typical example is a camera based surveillance system wherewired networks can also be used for data collection.

3. Data Communication

The basic issue in handling streams from sensors is to transmit them, either asraw measurements or in a compressed form. For example, the following querynecessitates transmission of temperature measurements from a set of sensors tothe user, over the wireless network.

”Return the temperature measurements of all the sensors in the subregionRevery10s, for the next 60 minutes”

Typically, the data communication direction is from multiple sensor nodesto a singlesinknode. Moreover, since the stream of measurements observed bysensors are that of a common phenomena, we observe redundancy in thedatacommunicated. For example, consider the following task posed from the sinknode to the system:

Due to the above characteristics, along with limited availability of power,the end-to-end communication protocols available for mobile ad-hoc networksare not applicable for sensor systems. The research community has thereforeproposed data aggregation as the solution wherein data from multiple sourcesare combined, processed within the network to eliminate redundancy and routedthrough the path that reduces the number of transmissions.

Energy-aware sensing and routing has been a topic of interest over there-cent years to extend the lifetime of the nodes in the network. Most of theapproaches create a hierarchical network organization, which is then used forrouting of queries and for communication between the sensors. [27] proposeda cluster based approach known as LEACH for energy-efficient datatransmis-sion. Cluster-head nodes collect the streaming data from the other sensors inthe cluster and apply signal processing functions to compress the data into asingle signal. As illustrated in Figure 15.1, cluster heads are chosen at randomand the sensors join the nearest cluster head. Now, a sensor communicates itsstream data to the corresponding cluster head, which in turn takes the respon-sibility of communicating them to the sink (possibly after compressing). A


Sink

Figure 15.1. An instance of dynamic cluster assignment in sensor system according toLEACHprotocol. Sensor nodes of the same clusters are shown with same symbol and the cluster headsare marked with highlighted symbols.

37, 45Location

Sink Interest Propagation

Initial Gradients setup

Reinforced Path

SourceTarget Detected Detected

Event

Figure 15.2. Interest Propagation, gradient setup and path reinforcement for data propagationin directed-diffusionparadigm. Event is described in terms of attribute value pairs. The figureillustrates an event detected based on the location of the node and target detection.

different approach is theDirected Diffusionparadigm proposed by [28] whichfollows adata centricapproach for routing data from sources to the sink sensor.Directed diffusion uses a publish-subscribe approach where the inquirer (say,the sink sensor) expresses aninterestusing attribute values and the sources thatcan serve the interest reply with data (Figure 15.2). As the data is propagatedtoward the sink, the intermediate sensors cache the data to prevent loops andeliminate duplicate messages.

Among the many research works with the goal of energy-aware routing,Geographic Adaptive Fidelity(GAF) approach proposed by [49] conservesenergy from the point of view of communication, by turning off the radios ofsome of the sensor nodes when they are redundant. The system is divided into


108

11136

12

89

8

59

6912

8

1013

13

915

8

69 8

912 8

119

13

13

1315

15

sink

129

Figure 15.3. Sensors aggregating the result for a MAX queryin-network

a virtual grid using geographic location information and only one sensor per acell in the grid is activated to route packets. ASCENT by [8] and STEM by[44] are examples of other available sensor topology control schemes.

Data aggregation is also an integral part of query processing. The resultsof the posed queries are communicated across the network with in-networkaggregation. We discuss more about this in Section 4.

4. Query ProcessingIn the common query frameworks followed in sensor systems, data collection

is driven by declarative queries (examples of such systems include COUGAR([54]) and TAG ([35])). Users pose declarative queries over the data generatedby the sensors. For example, the SQL-like query corresponding to the exampletask discussed in Section 3 is as follows.

SELECT S.temperatureFROM Sensor SWHERE S.loc IN RDURATION 3600sEVERY 10s

The simplest scheme for evaluating such queries on sensor data is to transmitall the data to a centralized database, which can then be used for answering userqueries. To improve on this scheme, [6], [34] and [53] suggested the conceptof viewing sensor system as a distributed database and proposed incorporatinga query layer in the system. The sensors in the system were now query-aware,which paved way for the following. Firstly, the sensors communicated themeasurements on-demand i.e., if they satisfied the predicate of the query. Sec-ondly, in-network processing of query results was now possible in query-awaresensors.


A direct implementation of the distributed database techniques are not viablein sensor systems due to their communication and computation constraints.Therefore, [6] and [34] studied the characteristics and challenges ofsensordatabases and the requirements of a good query plan for such systems. TheCOUGAR framework proposed by [53] aimed at reducing energy usagebygenerating efficient query plans for in-network query processing. The authorsproposed a general framework for in-network query processing (in[54]) wherethe query is decomposed intoflow blocksand a leader node from a set ofcoordinated nodes collects the query results of a block.

4.1 Aggregate Queries

The class of queries that has received interest in sensor systems isAggregateQueries. Recently various techniques are proposed to efficiently process aggre-gate queries such as MIN, COUNT and AVG in sensor systems while reducingpower consumption. An example of a simple query would be:

”Return the maximum of the temperature measurements obtained from allsensors located within coordinates[0, 100, 100, 0].”

The properties of the aggregate functions enables distributed processing ofpartial data in-network, which can be combined to produce results for the posedqueries. Such optimizations reduce energy consumption for query processingby orders of magnitudes. For example, Figure 15.3 shows a routing paradigmwhere MAX of the observations at the sensors are evaluated efficiently bycomputing the MAX of different groups of sensors, and communicating onlythe results to the sink sensor.

One of the first tools for processing aggregate queries in sensor systems, ina distributed and efficient manner, is TAG, presented by [35]. In TAG, queriesposed by users are propagated from a base station into the network, piggyback-ing the existing network protocol. Aggregate results are communicated backto the base station up a spanning tree, with each sensor combining its resultwith the results obtained from its children. Later, [25] studied the implemen-tation of TAG framework for various sensing applications such as sensordatasummarization, vehicle tracking and topographic mapping.

Various improvements and application specific modifications have been pro-posed recently based on the above query-tree framework suggested for dataaggregation and query processing. We give a brief overview of some of thetechniques in the following.

Extending the query-tree framework and the work by [40], [14] presenteda framework for in-network data aggregation to evaluate aggregate queries inerror-tolerant applications. According to this framework, the nodes of the querytree apply their error filters to the partial aggregates of their subtrees, and sup-press messages from being communicated to the sink (see Figure 15.4). In ad-


5

E5 = 2 E6 = 3 E7 = 4

E1 = 0

E2 = 3 E3 = 0

E8 = 6 E9 = 0

E4 = 0

E10 = 2

98

243

6 7

1

10

Global Error at Sink = 20

Figure 15.4. Error filter assignments in tree topology. The nodes that are shown shaded arethe passivenodes that take part only in routing the measurements. A sensor communicates ameasurement only if it lies outside the interval of values specified byEi i.e., maximum permittederror at the node. A sensor that receives partial results from its children aggregates the resultsand communicates them to its parent after checking against the error interval

S

Sink at Level 0S

Level 1

Level 2

Query Propagation

Partial Result Propagation

Figure 15.5. Usage of duplicate-sensitive sketches to allow result propagation to multiplepar-ents providing fault tolerance. The system is divided intolevelsduring the query propagationphase. Partial results from a higher level (level2 in the figure) is received at more than one nodein the lower level (Level1 in the figure)

dition, thepotential gainof increasing the error threshold of nodes is estimatedstatistically, to guide the allocation of error filters. In another extension, [45]illustrate that the energy consumption of TAG and COUGAR framework canbe reduced further throughgroup-awarenetwork configuration, where sensorsbelonging to same group are clustered along the same path, and by suppressingtransmissions of measurements withtemporal coherency.


Focusing on duplicate sensitive queries such as SUM and COUNT, [12]proposed a scalable algorithm that is fault-tolerant to sensor failure and com-putes approximate answers for the aggregate queries. The duplicate-insensitivesketches used in this work allows communication of results to multiple parents,as illustrated in Figure 15.5, to provide fault tolerance. Recently, [47] extendedthe class of queries supported to include quantiles, distribution of data in theform of histograms and most frequent items.

4.2 Join Queries

Certain application such as tracking and monitoring a moving object requiresexecution ofjoin queries over data streams produced at different sensors. Forexample, consider a query of the form

”Return the objects that were detected in both regions R1 and R2”.To evaluate the query, streams of observations from the sensors in regions

R1 and R2 are joined to determine if an object was spotted in both the regions.The applications of join queries in sensor networks were first discussedin the

COUGAR framework ([54]) which suggested making an informed decision,based on the selectivity of the join operator, to compute the join results in-network. Studying join queries in detail, [5] proposed a method for effectivejoin operator placement. In this, the authors assume long running queries andpropose a technique where the sensors continuously refine the placement ofjoin operator so as to minimize data transmissions over the network. However,the method is restricted to processing queries over pairs of sensors. Recently,[37] proposed REED, an extension of tinyDB for multi-predicate join queries,which can efficiently handle joins queries over multiple sensors and joins ofsensor data with external tables.

A non-blocking form of join processing issliding time window joinwherea time window over the timestamps of the tuples is given as a constraint, inaddition to the join conditions. This is studied as a natural way to handle joinson infinite streams such as those from sensors. For example,

”Return the objects that were detected by both sensors S1 and S2 and wherewindow(S1, S2) = w”

poses an additional constraint that the timestamps of the values from S1and S2 should be within windoww of time from each other to satisfy thequery predicate. [23] studied various forms of such window join queriesandproposedbackward and forward evaluationalgorithms BEW-join and FEW-join for executing them. However, the algorithms proposed can potentiallyproduce an unordered stream of tuples as result. The authors address thisproblem (in [24]) and propose and analyze several methods to providein-orderexecution of join queries over sensor streams.


4.3 Top-k Monitoring

Another interesting class of problem is to report thek highest ranked answersto a given query. An example to a top-k query in sensor system would be:

”Which sensors have reported the highest average temperature readings overthe past month?”.

The general problem of monitoring top-k values from streams that are pro-duced at distributed locations is discussed by [3]. The authors proposea tech-nique in which arithmetic constraints are maintained at the stream sources toensure the validity of the most recently communicated top-k answers. Theapproach provides answers within a user specified error tolerance and reducesoverall communication between the sources. A different approach to providingsolution for top-k queries in sensor systems following hierarchical topology isTJA(Threshold Join Algorithm), proposed by [57]. The algorithm consist ofthe initial phase of setting lower bound for the top-k results in the hierarchies,followed by a join phase that collects the candidate sets in a bottom-up manner.With a fixed number of round trips, in-network processing and fewer readingscommunicated to the sink, the method conserves energy and reduces delay inanswering the query.

Recently, [48] proposed a technique to answer top-k queries approximatelyby keeping samples of past sensor readings. When querying on a largesampleset, the nodes that appear frequently in the answers form a pattern that can assistin the estimation of optimum query plan. Based on this observation, the authorspropose a general framework of devising query plans with user defined energybudget, and applies it to answer top-k queries approximately.

4.4 Continuous Queries

Sensors deployed for monitoring interesting changes in the environment areoften required to answer queries continuously. For instance, motion or soundsensors might be used to automatically turn lights on by evaluating continuousqueries.

When more than one continuous query is evaluated over the readings, wecan optimize the storage and computation by taking advantage of the fact thatthe sources of the query and their partial results could overlap. ContinuouslyAdaptive Continuous Query (CACQ), implemented over Telegraph query pro-cessing engine, is an adaptive eddy-based design proposed by [33]which amor-tized query processing cost by sharing the execution of multiple long runningqueries. As a related work, we find that the approach proposed by [40] forproviding approximate answers for continuous queries is applicable in certainsensor based applications.

In long running queries, streams from different sensors are continuouslytransmitted to other sensors where the query operator is applied on the data.


Since the rate at which the data is produced by the operator varies over time,a dynamic assignment of operators to nodes reduces communication costs. Toachieve this, [5] have worked on optimizing in-network placement of queryoperators such as aggregate, filtering, duplicate elimination and correlation,where the nodes continuously refine the operator placement.

In many applications in sensor systems, the user is more interested in amacroscopic description of the phenomenon being observed, rather thantheindividual observations. For example, when sensors are deployed to detectfire hazards, thestateof the system is either ’Safe’ or ’Fire Alarm’. Specificqueries regarding the state can be posed to the individual sensors oncethe stateis detected. Recently, [22] have proposed an approach for state monitoring thatcomprises two processes. The first process is thelearning processwhere sensorreadings are clustered with user constraints and the clusters are used to definerules describing the state of the system. In thestate monitoringprocess thenodes collaborate to update the state of the network by applying the rules to thesensor observations.

5. Compression and Modeling

In certain applications in sensor systems the type of the query or the charac-teristics of interesting events is not known apriori. In such scenarios, summariesof the sensor data are stored either in-site or in-network or at the base station,and are used for answering the queries. For example, [18, 19] proposed storageof wavelet based summaries of sensor data, in-network, at various resolutions(spatial) of the system. Progressive aging of summaries and load sharing tech-niques are used to ensure long term storage and query processing.

A relevant problem is to compress the historical data from multiple streamsin order to transmit them to the base station. Recently, [15] proposed the SelfBased Regression (SBR) algorithm that provides an efficientbase-signalbasedtechnique to compress historical data in sensors. The base-signals that capturethe prominent features of the stream are extracted from the data and are trans-mitted to the base station, to aid in future reconstructions of the stream. ALVQ(Adaptive Linear Vector Quantization) algorithm proposed by [31] improveson the SBR algorithm by increasing the precision of compression and reduc-ing the bandwidth consumption by compressing the update of the codebook.In a different approach to compressing sensor streams, [43] assume linearityof data over small windows and evaluate a temporal compression scheme forsummarizing micro-climactic data stream.

It is clear that all the research contributions discussed here have a commongoal: to reduce power consumption of the sensors. Modeling the distributionof the data streams comes in handy when there is a requirement to reduce thepower consumptionfurther. This approach is highly recommended inacqui-


010

2030

4050

60

0

10

20

30

40

50

600

0.005

0.01

0.015

0.02

0.025

0.03

S1

S2 30 35 40 45 50 55 60 65 70

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

S1

P(S 1|S

2)

P(S1|S

2)

Figure 15.6. (a) Two dimensional Gaussian model of the measurements from sensorsS1 andS2 (b) The marginal distribution of the values of sensorS1, givenS2: New observations fromone sensor is used to estimate theposterior densityof the other sensors

sitional systemswhere considerable energy is consumed even for sensing thephenomenon, apart from the energy consumed in transmitting the values. Userqueries are answered based on the models, by prediction, and more data isacquired from the system if the prediction is not accurate. The accuracyofthe predictions thus serve as a guidance to determine which sensors shouldbe queried to update and refine the models, so that the future queries can beanswered more accurately.

5.1 Data Distribution Modeling

Over the recent years, there are many research undertakings in modeling ofsensor data. [20] proposed an interesting framework for in-network modelingof sensor data using distributed regression. The authors use linear regressionto model the data and the coefficients of kernel-based regression models arecomputed in-network. This technique exploits the temporal redundancy (theredundancy in readings from a sensor over time) and spatial redundancy (sensorsthat are close to each other measure similar values) that is common in sensorstreams. In [16], a multivariate Gaussian model over the sensors is used foranswering queries pertaining to one or more of the sensors. For example,consider a range query that asks:

”Is the value of a sensorS1 within the range[a, b]?”Instead of querying the sensor to obtain its reading for answering the query,

it is now possible to compute the probabilityP (S1 ∈ [a, b]) by marginalizingthe multivariate distribution over the density over onlyS1. If this is very high,the predicate is true and the predicate is false if it is very low. Otherwise, thesensor is required to transmit more data and the model is updated. In additionto updating the model with the new observations transmitted, the model is alsoupdated over time with one or more transition models.


(b)

(a)

past future

past future

window

time PDF

window

time PDF

Figure 15.7. Estimation of probability distribution of the measurements over sliding window

Measurements of low-cost attributes that are correlated to an expensivepred-icate can be used to predict the selectivity (i.e., the set of sensors to query) ofthe expensive predicate. This observation is utilized in [17] to optimize queryplans for expensive predicates.

5.2 Outlier Detection

Sensors might record measurements that appear to deviate significantly fromthe other measurements observed. When a sensor reports abnormal observa-tions, it might be due to an inherent variation in the phenomenon being observedor due to an erroneous data measurement procedure. In either case, such out-liers are interesting and has to be communicated across. [42] proposed anapproach for detecting outliers in a distributed manner, through non-parametricmodeling of sensor data. Probability distribution models of the data seen over arecent window are computed based on kernel density estimators, as illustratedin Figure 15.7. Since such models obtained at various sensors can be combinedefficiently, this approach makes it possible to have models at different hierar-chical levels of communication. The models are then used to detect outliers atvarious levels.

Figure 15.8 graphically depicts the trade-offs between the model size, thedesired accuracy of results and the resource consumption common in sensorsystems. As seen in the figure, a sensor reporting measurements from dynamicenvironment such as sounds from outdoor requires large model size and morenumber of message updates, compared to a sensor reporting indoor tempera-tures.


Messages

small size, few messages

large size, many messages

lareg size, few messages

(outdoor sounds)

(indoors temperature)

(outdoors temperature)

Number of update

Accuracy

Model size

Figure 15.8. Trade-offs in modeling sensor data

6. Application: Tracking of Objects using SensorNetworks

As seen in the above sections, sensor systems are potentially useful in vari-ous applications ranging from environmental data collection to defense relatedmonitoring. In this section, we briefly describe some of the recent researchworks that study surveillance and security management in sensor systems.Inparticular, we look attrackingtechniques where data from sensors are processedonline, in a real-time fashion, to locate and track moving objects. Tracking ofvehicles in battlefield and tracking of spread of wildfire in forests are some ofthe examples. Typically, sensor nodes that are deployed in a field are equippedwith the technology to detect the interesting objects (or in general,events).The sensors that detect the event collaborate with each other to determine theevent’s location and predict its trajectory. Power savings and resilience fromfailures are important factors to consider while devising an efficient strategy fortracking events.

One of the first works on tracking in sensor systems is by [60] who studiedthe problem of tracking a mobile target using an information theoretic approach.According to this method, the sensor that detects the target estimates the targetstate, determines the next best sensor and hands off the state information toit.Thus, only a single node is used to track the target at any time and the routingdecision is made based on information gain and resource cost.

Considering the problem in a different setting, [2] proposed a model fortracking a moving object with binary sensors. According to this, each sensornode communicates one bit of information to a base station. The bit denoteswhether an object is approaching it or moving away from it. The authorspropose a filtering style approach for tracking the object. The method involvesa centralized computational structure which is expensive in terms of energy


consumption. In the method proposed by [10], the network is divided intoclusters and the cluster heads calculates the target location based on the signalreadings from the other nodes in the cluster.

The problem of tracking multiple objects has been studied by [26] where theauthors propose a method based on stochastic approaches for simultaneouslytracking and maintaining identities of multiple targets. Addressing the issueof multiple-target identity management, [46] introducedidentity belief matrixwhich is a doubly stochastic matrix forming a description of the identity in-formation of each target. The matrix is computed and updated in a distributedfashion.

Apart from the above works which present tracking techniquesper se, wealso see few methods that employ some communication framework in order totrack targets.Dynamic Convoy Tree-based Collaboration(DCTC) frameworkproposed by [59] relies onconvoy treewhich includes the sensors around themoving target. As the target moves, the tree dynamically evolves by adding andpruning some nodes. The node close to the target is the root of the tree whereall the sensing data is aggregated.

[32] discuss a group management method for track initiation and manage-ment in target tracking application. On detecting the target, sensors send mes-sage to each other and a leader is selected among them based on the time stampof the messages. All sensors that detect the target abandon detection and jointhe group of the selected leader and the leader gets the responsibility to maintainthe collaborative group.

Predictive target tracking based on a cluster based approach is presented by[52] where the target’s future location is predicted based on the currentlocation.In order to define the current location, the cluster head aggregates the informa-tion from three sensors in its cluster. Then the next location is predicted basedon an assumption that it obeys two dimensional Gaussian distribution. [50]proposed aprediction-basedenergy saving scheme for reducing energy con-sumption for object tracking under acceptable conditions. The prediction mod-els is built on the assumption that the movement of the object usually remainsconstant for a certain period of time. The heuristics for wake-up mechanismconsiders only the predicted destination node, or all the nodes on the routefrom current node to destination node, or all the neighbors of all the nodesalong the predicted route. The errors in the estimate of the target’s movementare corrected by filtering and probabilistic methods, thus accurately definingthe sensors to be notified.

Recently, [21] proposed a two level approach for tracking a target bypredict-ing its trajectory. In this scheme, a low-level loop is executed at the sensorstodetect the presence of target and estimate its trajectory using local information,whereas the global high level loop is used to combine the local informationand predict the trajectory across the system. The system is divided into cellsas


Monitoring RegionBoundary of the Next

Sensor Node

Target

Leader to leader wake−up message

Leader Node

Figure 15.9. Tracking a target. The leader nodes estimate the probability of the target’s directionand determines the next monitoring region that the target is going to traverse. The leaders of thecells within the next monitoring region are alerted

shown in Figure 15.9. Kalman filters are used for predicting the target locationlocally and the estimations are combined by the leaders of the cells. The proba-bility distribution function of the target’s direction and location are determinedusing Kernel functions and the neighboring cell leaders are alerted based on theprobability estimation. [1, 56]

7. Summary

In this chapter we reviewed recent work on distributed stream processingtechniques for data collected by sensor networks. We have focused onthe sensormonitoring paradigm, where a large set of inexpensive sensors is deployed forsurveillance or monitoring of events of interest.

The large size of data and the distributed nature of the system necessitatethe development and use of in-network storage and analysis techniques;herewe have focused on the analysis part. However, future systems will operatewith larger more expensive and capable sensors (for example video cameras).Consequently, future research work will have to address important andfunda-mental issues on how to efficiently stoe, index, and analyze large datasets insensor networks.

The development of efficient techniques for local (in the sensor) storage ofthe data (perhaps using inexpensive and widely available flash memory), as wellas for distributed data storage, and the development and deployment of resourcemanagement techniques to manage the resources of the sensor network willbevery important in addressing these issues.


References

[1] Ali, M. H., Mokbel M. F., Aref W. G. and Kamel I. (2005) Detectioin andTracking of Discrete Phenomena in Sensor-Network Databases.In Pro-ceedings of the 17th International Conference on Scientific and StatisticalDatabase Management.

[2] Aslam J., Butler Z., Constantin F., Crespi V., Cybenko G. and Rus D. (2003)Tracking a Moving Object with a Binary Sensor Network.In Proceedingsof ACM SenSys.

[3] Babcock B., and Olston C. (2003) Distributed top-k monitoring.In Pro-ceedings of the ACM SIGMOD International Conference on Managementof Data.

[4] Biswas R., Thrun S. and Guibas L. J. (2004) A probabilistic approachto inference with limited information in sensor networks.In Proceedingsof the 3rd International Conference on Information Processing in SensorNetworks.

[5] Bonfils, B. J., and Bonnet P. (2003) Adaptive and Decentralized Oper-ator Placement for In-Network Query Processing.In Proceedings of 2ndInternational Conference on Information Processing in Sensor Networks.

[6] Bonnet P., Gehrke J., and Seshadri P. (2001) Towards SensorDatabaseSystems.In Proceedings of the 2nd International Conference on MobileData Management,London, UK.

[7] Cerpa A., Elson J., Estrin D., Girod L., Hamilton M. and Zhao J. (2001)Habitat monitoring: Application driver for wireless communications tech-nology.In Proceedings of ACM SIGCOMM Workshop on Data Communi-cations in Latin America and the Caribbean.

[8] Cerpa A. and Estrin D. (2002) ASCENT: Adaptive Self-ConfiguringsEnsorNetworks Topologies.In Proceedings of IEEE INFOCOM.

[9] Benjie C., Jamieson K., Balakrishnan H. and Morris R. (2001) Span: Anenergy-efficient coordination algorithm for topology maintenance in AdHoc wireless networks.In Proceedings of the 7th ACM International Con-ference on Mobile Computing and Networking.

[10] Chen W., Hou J. C. and Sha L. (2003) Dynamic Clustering for AcousticTarget Tracking in Wireless Sensor Networks.In Proceedings of the 11thIEEE International Conference on Network Protocols.

[11] Chu M., Haussecker H. and Zhao F. (2002) Scalable information-drivensensor querying and routing for ad hoc heterogeneous sensor networks.International Journal of High Performance Computing Applications.


[12] Considine J., Li F., Kollios G. and Byers J. (2004) Approximate aggrega-tion techniques for sensor databases.In Proceedings of the 20th Interna-tional Conference on Data Engineering.

[13] Crossbow Technology Inc.http://www.xbow.com/

[14] Deligiannakis A., Kotidis Y. and Roussopoulos N. (2004) Hierarchicalin-Network Data Aggregation with Quality Guarantees.Proceedings of the9th International Conference on Extending DataBase Technology.

[15] Deligiannakis A., Kotidis Y. and Roussopoulos N. (2004) CompressingHistorical Information in Sensor Networks.In Proceedings of the ACMSIGMOD International Conference on Management of Data.

[16] Deshpande A., Guestrin C., Madden S. R., Hellerstein J. M. and HongW.(2004) Model-Driven Data Acquisition in Sensor Networks.In Proceed-ings of the 30th International Conference on Very Large Data Bases.

[17] Deshpande A., Guestrin C., Hong W. and Madden S. (2005) ExploitingCorrelated Attributes in Acquisitional Query Processing.In Proceedings ofthe 21st International Conference on Data Engineering.

[18] Ganesan D., Greenstein B., Perelyubskiy D., Estrin D. and HeidemannJ.(2003) An Evaluation of Multi-Resolution Storage for Sensor Networks.In Proceedings of ACM SenSys.

[19] Ganesan D., Estrin D. and Heidemann J. (2003) Dimensions: Why do weneed a new data handling architecture for sensor networks?.ACM SIG-COMM Computer Communication Review.

[20] Guestrin C., Bodik P., Thibaux R., Paskin M. and Madden S. (2004) Dis-tributed Regression: an Efficient Framework for Modeling Sensor NetworkData.In Proceedings of the 3rd International Conference on InformationProcessing in Sensor Networks.

[21] Halkidi M., Papadopoulos D., Kalogeraki V. and Gunopulos D., (2005)Resilient and Energy Efficient Tracking in Sensor Networks.InternationalJournal of Wireless and Mobile Computing.

[22] Halkidi M., Kalogeraki V., Gunopulos D., Papadopoulos D., Zeinalipour-Yazti D. and Vlachos M. (2006) Efficient Online State Tracking UsingSensor Networks.In Proceedings of the 7th International Conference onMobile Data Management.

[23] Hammad M. A., Aref W. G. and Elmagarmid A. K. (2003) Stream Win-dow Join: Tracking Moving Objects in Sensor-Network Databases.In Pro-ceedings of the 15th International Conference on Scientific and StatisticalDatabase Management.

[24] Hammad M. A., Aref W. G. and Elmagarmid A. K. (2005) OptimizingIn-Order Execution of Continuous Queries over Streamed Sensor Data.In


Proceedings of the 17th International Conference on Scientific and Statis-tical Database Management.

[25] Hellerstein J. M., Hong W., Madden S. and Stanek K. (2003) BeyondAv-erage: Toward Sophisticated Sensing with Queries.International Workshopon Information Processing in Sensor Networks.

[26] Hwang I., Balakrishnan H., Roy K., Shin J., Guibas L. and Tomlin C.(2003) Multiple Target Tracking and Identity Management.In Proceedingsof the 2nd IEEE Conference on Sensors.

[27] Heinzelman, W. R., Chandrakasan A. and Balakrishnan H. (2000)Energy-Efficient Communication Protocol for Wireless Microsensor Networks.InProceedings of the 33rd Hawaii Intl. Conf. on System Sciences,Volume 8.

[28] Intanagonwiwat C., Govindan R. and Estrin D. (2000) Directed diffusion:a scalable and robust communication paradigm for sensor networks.In Pro-ceedings of the 6th ACM International Conference on Mobile Computingand Networking.

[29] Krishnamachari B., Estrin D. and Wicker S. (2002) Modelling Data-Centric Routing in Wireless Sensor Networks.In Proceedings of the 21stAnnual Joint Conference of the IEEE Computer and Communications So-cieties (INFOCOM).

[30] Lazaridis I., and Mehrotra S. (2003) Capturing sensor-generated time se-ries with quality guarantees.In Proceedings of the 19th International Con-ference on Data Engineering.

[31] Lin S., Kalogeraki V., Gunopulos D. and Lonardi S. (2006) OnlineInfor-mation Compression in Sensor Networks.In Proceedings of IEEE Interna-tional Conference on Communications.

[32] Liu J., Reich J., Cheung P. and Zhao F. (2003) Distributed Group Man-agement for Track Initiation and Maintenance in Target Localization Ap-plications.IPSN Workshop.

[33] Madden S, Shah M. A., Hellerstein J. M. and Raman V. (2002) Contin-uously Adaptive Continuous Queries over Streams.In Proceedings of theACM SIGMOD International Conference on Management of Data.

[34] Madden S. R. and Franklin M. J., (2002) Fjording the Stream: An Archi-tecture for Queries Over Streaming Sensor Data.In Proceedings of the 18thInternational Conference on Data Engineering.

[35] Madden S. R., Franklin M. J. and Hellerstein J. M. (2002) TAG: A TinyAggregation Service for Ad-Hoc Sensor Networks.In Proceedings of the5th Symposium on Operating System Design and Implementation.

[36] Madden S. R., Franklin M. J., Hellerstein J. M. and Hong W. (2003) TheDesign of an Acquisitional Query Processor for Sensor Networks.In Pro-


ceedings of the ACM SIGMOD International Conference on Managementof Data.

[37] Madden S. R., Lindner W. and Abadi D. (2005) REED: Robust, EfficientFiltering and Event Detection in Sensor Networks.In Proceedings of the31st International Conference on Very Large Data Bases.

[38] Mainwaring A., Polastre J., Szewczyk R., Culler D. and Anderson J.(2002)Wireless Sensor Networks for Habitat Monitoring.In Proceedings of ACMInternational Workshop on Wireless Sensor Networks and Applications.

[39] Omar S. A. Assessment of Oil Contamination in Oil Trenches Locatedin Two Contrasting Soil Types, (2001)Conference on Soil, Sediments andWater.Amherst, MA, USA.

[40] Olston C., Jiang J. and Widom J., (2003) Adaptive filters for continuousqueries over distributed data streams.In Proceedings of the ACM SIGMODInternational Conference on Management of Data.

[41] Pottie G. J., and Kaiser W. J. (2000) Wireless integrated network sensors.In Communications of the ACM,Volume 43, Issue 5.

[42] Palpanas T., Papadopoulos D., Kalogeraki V. and Gunopulos D. (2003)Distributed deviation detection in sensor networks.ACM SIGMODRecords,Volume 32, Issue 4.

[43] Schoellhammer T., Osterweil E., Greenstein B., Wimbrow M. and EstrinD. (2004) Lightweight Temporal Compression of Microclimate Datasets.In Proceedings of the 29th Annual IEEE Conference on Local ComputerNetworks.

[44] Schurgers C., Tsiatsis V., Ganeriwal S. and Srivastava M. (2002) Topol-ogy management for sensor networks: exploiting latency and density.InProceedings of the 3rd ACM international symposium on Mobile ad hocnetworking & computing

[45] Sharaf M. A., Beaver J., Labrinidis A. and Chrysanthis P. K. (2004) Bal-ancing energy efficiency and quality of aggregate data in sensor networks.The VLDB Journal.

[46] Shin J., Guibas L. and Zhao F. (2003) Distributed Identity ManagementAlgorithm in Wireless Ad-hoc Sensor Network.2nd Int’l Workshop onInformation Processing in Sensor Networks

[47] Shrivastava N., Buragohain C., Agrawal D. and Suri S. (2004)Mediansand Beyond: New Aggregation Techniques for Sensor Networks.In Pro-ceedings of ACM SenSys.

[48] Silberstein A., Braynard R., Ellis C., Munagala K. and Yang J. (2006)A Sampling-Based Approach to Optimizing Top-k Queries in Sensor Net-works. In Proceedings of the 22nd International Conference on Data En-gineering.


[49] Xu Y., HeidemannJ. and Estrin D. (2001) Geography-informed energyconservation for Ad Hoc routing.In Proceedings of the 7th ACM Interna-tional Conference on Mobile Computing and Networking.

[50] Xu Y., Winter J. and Lee W. (2004) Prediction-Based Strategies forEnergySaving in Object Tracking Sensor Networks.IEEE International Confer-ence on Mobile Data Management.

[51] Xu N., Rangwala S., Chintalapudi K. K., Ganesan D., Broad A., GovindanR. and Estrin D. (2004) A wireless sensor network For structural moni-toring. In Proceedings of the 2nd international conference on Embeddednetworked sensor systems.

[52] Yand H. and Sikdar B. (2003) A Protocol for Tracking Mobile Targetsusing Sensor Networks.IEEE International Workshop on Sensor NetworksProtocols and Applications.

[53] Yao Y. and Gehrke J. The cougar approach to in-network queryprocessingin sensor networks.ACM SIGMOD Records.

[54] Yao Y. and Gehrke J. (2003) Query Processing for Sensor Networks.Conference on Innovative Data Systems Research.

[55] Ye F., Luo H., Cheng J., Lu S. and Zhang L. (2002) A Two-Tier DataDissemination Model for Large-Scale Wireless Sensor Networks.In Pro-ceedings of the 8th ACM International Conference on Mobile Computingand Networking.

[56] Yu X., Niyogi K., Mehrotra S. and Venkatasubramanian N. (2004) Adap-tive Target Tracking in Sensor Networks.The Communication Networksand Distributed Systems Modeling and Simulation Conference.

[57] Zeinalipour-Yazti D., Vagena Z., Gunopulos D., Kalogeraki V., TsotrasV., Vlachos M., Koudas N. and Srivastava D. (2005) The threshold joinalgorithm for top-k queries in distributed sensor networks.In Proceedings ofthe 2nd international workshop on Data management for sensor networks.

[58] Zeinalipour-Yazti D., Kalogeraki V., Gunopulos D., Mitra A., Banerjee A.and Najjar W. A. (2005) Towards In-Situ Data Storage in Sensor Databases.Panhellenic Conference on Informatics.

[59] Zhang W. and Cao G. Optimizing Tree Reconfiguration for Mobile Targettracking in Sensor Networks.In Proceeding of IEEE INFOCOM.

[60] Zhano F., Shin J. and Reich J. (2002) Information-Driven DynamicSensorCollaboration for Tracking Applications.IEEE Signal Processing Maga-zine,Vol. 19.

[61] Zhano J., Govindan R. and Estrin D. (2003) Computing Aggregates forMonitoring Wireless Sensor Networks.IEEE International Workshop onSensor Network Protocols Applications.

Index

Age Based Model for Stream Joins, 218ALVQ (Adaptive Linear Vector Quantization,

342ANNCAD Algorithm, 51approximate query processing, 170ASCENT, 336Aurora, 142

BasicCounting in Sliding Window Model, 150Bayesian Network Learning from Distributed

Streams, 321Biased Reservoir Sampling, 99, 175

Change Detection, 86Change Detection by Distribution, 86Chebychev Inequality, 173Chernoff Bound, 173Classic Caching, 223Classification in Distributed Data Streams, 297Classification in Sensor Networks, 314Classification of Streams, 39Clustering in Distributed Streams, 295Clustering streams, 10CluStream, 10Community Stream Evolution, 96Compression and Modeling of Sensor Data, 342Concept Drift, 45Concise Sampling, 176Content Distribution Networks, 256Continuous Queries in Sensor Networks, 341Continuous Query Language, 211Correlation Query, 252Correlation Query Monitoring, 252COUGAR, 339Count-Min Sketch, 191CQL Semantics, 211Critical Layers, 110CVFDT, 47

Damped Window Model for Frequent PatternMining, 63

Data Communication in Sensor Networks, 335Data Distribution Modeling in Sensor Networks,

343

Data Flow Diagram, 136Decay based Algorithms in Evolving Streams,

99Density Estimation of Streams, 88Dimensionality Reduction of Data Streams, 261Directed Diffusion in Sensor Networks, 336Distributed Mining in Sensor Networks, 311Distributed Monitoring Systems, 255Distributed Stream Mining, 310

Ensemble based Stream Classification, 45Equi-depth Histograms, 196Equi-width Histograms, 196Error Tree of Wavelet Representation, 179Evolution, 86Evolution Coefficient, 96Exploiting Constraints, 214Exponential Histogram (EH), 149Extended Wavelets for Multiple Measures, 182

Feature Extraction for Indexing, 243Fixed Window Sketches for Time Series, 185Forecasting Data Streams, 261Forward Density Profile, 91Frequency Based Model for Stream Joins, 218Frequent Itemset Mining in Distributed Streams,

296Frequent Pattern Mining in Distributed Streams,

314Frequent Pattern Mining in streams, 61Frequent Temporal Patterns, 79Futuristic Query Processing, 27

General Stochastic Models for Stream Joins, 219Geographic Adaptive Fidelity, 336Graph Stream Evolution, 96

H-Tree, 106Haar Wavelets, 177Hash Functions for Distinct Elements, 193Heavy Hitters in Data Streams, 79High Dimensional Evolution Analysis, 96High Dimensional Projected Clustering, 22Histograms, 196


Hoeffding Inequality, 46, 174Hoeffding Trees, 46HPSTREAM, 22

Iceberg Cubing for Stream Data, 112Index Maintenance in Data Streams, 244Indexing Data Streams, 238

Join Estimation with Sketches, 187Join Queries in Sensor Networks, 340Joins in Data Streams, 209

Las Vegas Algorithm, 157LEACH, 335LEACH Protocol, 335Load Shedding, 127Load Shedding for Aggregation Queries, 128Loadshedding for classification queries, 145Loadstar, 145LWClass Algorithm, 49

MAIDS, 117Markov Inequality, 173Maximum Error Metric for Wavelets, 181, 182Micro-clustering, 10Micro-clusters for Change Detection, 96micro-clusters for classification, 48Minimal Interesting Layer, 104Mobile Object Tracking in Sensor Networks, 345Mobimine, 300Monitoring an Aggregate Query, 248Monte Carlo Algorithm, 157Motes, 333MR-Index, 254Multi-resolution Indexing Architecture, 239Multiple Measures for Wavelet Decomposition,

182, 183MUSCLES, 265

Network Intrusion Detection, 28Network Intrusion Detection in Distributed

Streams, 293NiagaraCQ, 313Normalized Wavelet Basis, 179Numerical Interval Pruning, 47

Observation Layer, 104OLAP, 103On Demand Classification, 24, 48Online Information Network (OLIN), 48Outlier Detection in Distributed Streams, 291Outlier Detection in Sensor Networks, 344

Partial Materialization of Stream Cube, 111Placement of Load Shedders, 136Popular Path, 103Prefix Path Probability for Load Shedding, 138

Privacy Preserving Stream Mining, 26Probabilistic Modeling of Sensor Networks, 256Projected Stream Clustering, 22Pseudo-random number generation for sketches,

186Punctuations for Stream Joins, 214Pyramidal Time Frame, 12

Quantile Estimation, 198Quantiles and Equi-depth Histograms, 198Query Estimation, 26Query Processing in Sensor Networks, 337Query Processing with Wavelets, 181Querying Data Streams, 238

Random Projection and Sketches, 184Relational Join View, 211Relative Error Histogram Construction, 198Reservoir Sampling, 174Reservoir Sampling with sliding window, 175Resource Constrained Distributed Mining, 299Reverse Density Profile, 91

Sampling for Histogram Construction, 198SCALLOP Algorithm, 51Second-Moment Estimation with Sketches, 187Selective MUSCLES, 269Semantics of Joins, 212Sensor Networks, 333Shifted Wavelet Tree, 254Sketch based Histograms, 200Sketch Partitioning, 189Sketch Skimming, 190Sketches, 184Sketches and Sensor Networks, 191Sketches withp-stable distributions, 190Sliding Window Joins, 211Sliding Window Joins and Loadshedding, 144Sliding Window Model, 149Sliding Window Model for Frequent Pattern

Mining, 63Spatial Velocity Profiles, 95SPIRIT, 262State Management for Stream Joins, 213Statistical Forecasting of Streams, 27STREAM, 18, 131Stream Classification, 23Stream Cube, 103Stream Processing in Sensor Networks, 333Sum Problem in Sliding-window Model, 151Supervised Micro-clusters, 23synopsis construction in streams, 169

Temporal Locality, 100Threshold Join Algorithm, 341Tilted Time Frame, 105, 108Top-k Items in Distributed Streams, 79Top-k Monitoring in Distributed Streams, 299

INDEX 355

Top-k Monitoring in Sensor Networks, 341

V-Optimal Histograms, 197, 198Variable Window Sketches for Time Series, 185Vehicle Data Stream Mining System (VEDAS),

300Velocity Density, 87

VFDT, 46

Wavelet based Histograms, 199Wavelets, 177Windowed Stream Join, 210Wireless Sensor Networks, 333Workload Aware Wavelet Decomposition, 184

data streams: models and algorithms - charu c. aggarwal

Documents