learning from data streams

7
Encyclopedia of Data Warehousing and Mining Second Edition John Wang Montclair State University, USA Hershey • New York INFORMATION SCIENCE REFERENCE Volume III K-Pri

Upload: purdue

Post on 03-Dec-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Encyclopedia of DataWarehousing and Mining

Second Edition

John WangMontclair State University, USA

Hershey • New YorkInformatIon ScIence reference

Volume IIIK-Pri

Director of Editorial Content: Kristin KlingerDirector of Production: Jennifer NeidigManaging Editor: Jamie SnavelyAssistant Managing Editor: Carole CoulsonTypesetter: Amanda Appicello, Jeff Ash, Mike Brehem, Carole Coulson, Elizabeth Duke, Jen Henderson, Chris Hrobak, Jennifer Neidig, Jamie Snavely, Sean Woznicki Cover Design: Lisa TosheffPrinted at: Yurchak Printing Inc.

Published in the United States of America by Information Science Reference (an imprint of IGI Global)701 E. Chocolate Avenue, Suite 200Hershey PA 17033Tel: 717-533-8845Fax: 717-533-8661E-mail: [email protected] site: http://www.igi-global.com/reference

and in the United Kingdom byInformation Science Reference (an imprint of IGI Global)3 Henrietta StreetCovent GardenLondon WC2E 8LUTel: 44 20 7240 0856Fax: 44 20 7379 0609Web site: http://www.eurospanbookstore.com

Copyright © 2009 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher.

Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark.

Library of Congress Cataloging-in-Publication Data

Encyclopedia of data warehousing and mining / John Wang, editor. -- 2nd ed. p. cm. Includes bibliographical references and index. Summary: "This set offers thorough examination of the issues of importance in the rapidly changing field of data warehousing and mining"--Provided by publisher. ISBN 978-1-60566-010-3 (hardcover) -- ISBN 978-1-60566-011-0 (ebook) 1. Data mining. 2. Data warehousing. I. Wang, John, QA76.9.D37E52 2008 005.74--dc22 2008030801

British Cataloguing in Publication DataA Cataloguing in Publication record for this book is available from the British Library.

All work contributed to this encyclopedia set is new, previously-unpublished material. The views expressed in this encyclopedia set are those of the authors, but not necessarily of the publisher.

If a library purchased a print copy of this publication, please go to http://www.igi-global.com/agreement for information on activating the library's complimentary electronic access to this publication.

1137

L

Section: Data Streams

Learning from Data Streams

João GamaUniversity of Porto, Portugal

Pedro Pereira Rodrigues University of Porto, Portugal

Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.

INTRODUCTION

In the last two decades, machine learning research and practice has focused on batch learning usually with small datasets. In batch learning, the whole training data is available to the algorithm that outputs a decision model after processing the data eventually (or most of the times) multiple times. The rationale behind this practice is that examples are generated at random ac-cordingly to some stationary probability distribution. Also, most learners use a greedy, hill-climbing search in the space of models.

What distinguishes current data sets from earlier ones are the continuous flow of data and the automatic data feeds. We do not just have people who are enter-ing information into a computer. Instead, we have computers entering data into each other. Nowadays there are applications in which the data is modelled best not as persistent tables but rather as transient data streams. In some applications it is not feasible to load the arriving data into a traditional DataBase Manage-ment Systems (DBMS), and traditional DBMS are not designed to directly support the continuous queries required in these application (Babcock et al., 2002). These sources of data are called Data Streams. There is a fundamental difference between learning from small datasets and large datasets. As pointed-out by some researchers (Brain & Webb, 2002), current learning algorithms emphasize variance reduction. However, learning from large datasets may be more effective when using algorithms that place greater emphasis on bias management.

Algorithms that process data streams deliver ap-proximate solutions, providing a fast answer using few memory resources. They relax the requirement of an exact answer to an approximate answer within a small error range with high probability. In general, as

the range of the error decreases the space of computa-tional resources goes up. In some applications, mostly database oriented, an approximate answer should be within an admissible error margin. Some results on tail inequalities provided by statistics are useful to ac-complish this goal.

LEARNING ISSUES: ONLINE, ANyTIME AND REAL-TIME LEARNING

The challenge problem for data mining is the ability to permanently maintain an accurate decision model. This issue requires learning algorithms that can modify the current model whenever new data is available at the rate of data arrival. Moreover, they should forget older information when data is out-dated. In this con-text, the assumption that examples are generated at random according to a stationary probability distribu-tion does not hold, at least in complex systems and for large time periods. In the presence of a non-stationary distribution, the learning system must incorporate some form of forgetting past and outdated informa-tion. Learning from data streams require incremental learning algorithms that take into account concept drift. Solutions to these problems require new sampling and randomization techniques, and new approximate, incremental and decremental algorithms. In (Hulten & Domingos, 2001), the authors identify desirable properties of learning systems that are able to mine continuous, high-volume, open-ended data streams as they arrive: i) incrementality, ii) online learning, iii) constant time to process each example using fixed memory, iv) single scan over the training data, and v) tacking drift into account.

Examples of learning algorithms designed to process open-ended streams include predictive learning: Deci-

1138

Learning from Data Streams

sion Trees (Domingos & Hulten, 2000; Hulten et al., 2001; Gama et al., 2005, 2006), Decision Rules (Ferrer et al., 2005); descriptive learning: variants of k-Means Clustering (Zhang et al., 1996; Sheikholeslami et al., 1998), Clustering (Guha et al., 1998; Aggarwal et al., 2003), Hierarchical Time-Series Clustering (Rodrigues et al., 2006); Association Learning: Frequent Itemsets Mining (Jiang & Gruemwald, 2006), Frequent Pattern Mining (Jin & Agrawal 2007); Novelty Detection (Markou & Singh, 2003; Spinosa et al. 2007); Feature Selection (Sousa et al., 2006), etc.

All these algorithms share some common properties. They process examples at the rate they arrive using a single scan of data and fixed memory. They maintain a decision model at any time, and are able to adapt the model to the most recent data.

Incremental and Decremental Issues

The ability to update the decision model whenever new information is available is an important property, but it is not enough. Another required operator is the ability to forget past information (Kifer et al., 2004). Some data stream models allow delete and update operators. For example, sliding windows models require the forgetting of old information. In these situations the incremental property is not enough. Learning algorithms need forgetting operators that reverse learning: decremental unlearning (Cauwenberghs & Poggio, 2000).

Cost-Performance Management

Learning from data streams require to update the deci-sion model whenever new information is available. This ability can improve the flexibility and plasticity of the algorithm in fitting data, but at some cost usually mea-sured in terms of resources (time and memory) needed to update the model. It is not easy to define where is the trade-off between the benefits in flexibility and the cost for model adaptation. Learning algorithms exhibit different profiles. Algorithms with strong variance management are quite efficient for small training sets. Very simple models, using few free-parameters, can be quite efficient in variance management, and effective in incremental and decremental operations (for example naive Bayes (Domingos & Pazzani, 1997)) being a natural choice in the sliding windows framework. The main problem with simple approaches is the boundary in generalization performance they can achieve; they

are limited by high bias. Complex tasks requiring more complex models increase the search space and the cost for structural updating. These models require efficient control strategies for the trade-off between the gain in performance and the cost of updating.

Monitoring Learning

Whenever data flows over time, it is highly improvable the assumption that the examples are generated at ran-dom according to a stationary probability distribution (Basseville & Nikiforov, 1993). At least in complex systems and for large time periods, we should expect changes (smooth or abrupt) in the distribution of the examples. A natural approach for these incremental tasks is adaptive learning algorithms, incremental learning algorithms that take into account concept drift.

Change Detection

Concept drift (Klinkenberg, 2004, Aggarwal, 2007) means that the concept about which data is being collected may shift from time to time, each time after some minimum permanence. Changes occur over time. The evidence for changes in a concept is reflected in some way in the training examples. Old observations, that reflect the behaviour of nature in the past, become irrelevant to the current state of the phenomena under observation and the learning agent must forget that information. The nature of change is diverse. Changes may occur in the context of learning, due to changes in hidden variables, or in the characteristic properties of the observed variables.

Most learning algorithms use blind methods that adapt the decision model at regular intervals without considering whether changes have really occurred. Much more interesting are explicit change detection mechanisms. The advantage is that they can provide meaningful description (indicating change-points or small time-windows where the change occurs) and quantification of the changes. They may follow two different approaches:

• Monitoring the evolution of performance indica-tors adapting techniques used in Statistical Process Control (Gama et al., 2004).

• Monitoring distributions on two different time windows (Kiffer et al., 2004). The method moni-tors the evolution of a distance function between

1139

Learning from Data Streams

Ltwo distributions: data in a reference window and in a current window of the most recent data points.

The main research issue is to develop methods with fast and accurate detection rates with few false alarms. A related problem is: how to incorporate change detection mechanisms inside learning algorithms. Also, the level of granularity of decision models is a relevant property (Gaber et al., 2004), because it can allow partial, fast and efficient updating in the decision model instead of rebuilding a complete new model whenever a change is detected. Finally, the ability to recognize seasonal and re-occurring patterns is an open issue.

FUTURE TRENDS IN LEARNING FROM DATA STREAMS

Streaming data offers a symbiosis between Streaming Data Management Systems and Machine Learning. The techniques developed to estimate synopsis and sketches require counts over very high dimensions both in the number of examples and in the domain of the variables. On one hand, the techniques developed in data streams management systems can provide tools for designing Machine Learning algorithms in these domains. On the other hand, Machine Learning provides compact descriptions of the data than can be useful for answering queries in DSMS. What are the current trends and directions for future research in learning from data streams?

Issues on incremental learning and forgetting are basic issues in stream mining. In most applications, we are interested in maintaining a decision model consistent with the current status of the nature. This has led us to the sliding window models where data is continuously inserted and deleted from a window. So, learning algo-rithms must have operators for incremental learning and forgetting. Closed related are change detection issues. Concept drift in the predictive classification setting is a well studied topic (Klinkenberg, 2004). In other scenarios, like clustering, very few works address the problem. The main research issue is how to incorporate change detection mechanisms in the learning algorithm for different paradigms. Another relevant aspect of any learning algorithm is the hypothesis evaluation criteria and metrics. Most of evaluation methods and metrics were designed for the static case and provide a single

measurement about the quality of the hypothesis. In the streaming context, we are much more interested in how the evaluation metric evolves over time. Results from the sequential statistics (Wald, 1947) may be much more appropriate.

CONCLUSION

Learning from data streams is an increasing research area with challenging applications and contributions from fields like data bases, learning theory, machine learning, and data mining. Sensor networks, scientific data, monitoring processes, web analysis, traffic logs, are examples of real-world applications were stream algorithms have been successful applied. Continuously learning, forgetting, self-adaptation, and self-reaction are main characteristics of any intelligent system. They are characteristic properties of stream learning algorithms.

REFERENCES

Aggarwal, C., Han, J., Wang, J., & Yu, P. (2003). A framework for clustering evolving data streams. In VLDB 2003, Proceedings of Twenty-Ninth International Conference on Very Large Data Bases (pp. 81-92). Morgan Kaufmann.

Aggarwal, C. C. (2007). A Survey of Change Diagno-sis, In C. Aggarwal (Ed.), Data Streams: Models and Algorithms (pp. 85-102). Springer.

Babcock, B., Babu, S., Datar, M., Motwani., R., & Widom. J. (2002) Models and issues in data stream systems. In Lucian Popa (Ed.), Proceedings of the 21st Symposium on Principles of Database Systems (pp. 1-16). ACM Press.

Basseville M., & Nikiforov I. (1993). Detection of Abrupt Changes - Theory and Application. Prentice-Hall.

Brain D., & Webb G. (2002). The need for low bias algorithms in classification learning from large data sets. In T.Elomaa, H.Mannila, and H.Toivonen (Eds.), Principles of Data Mining and Knowledge Discovery PKDD-02 (pp. 62-73). LNAI 2431, Springer Verlag.

1140

Learning from Data Streams

Cauwenberghs G., & Poggio T. (2000). Incremental and decremental support vector machine learning. In T. K. Leen, T. G. Dietterich and V. Tresp (Eds.), Proceedings of the 13th Neural Information Processing Systems (pp. 409-415). MIT Press.

Domingos P., & Hulten G. (2000). Mining High-Speed Data Streams. In Proceedings of the ACM Sixth In-ternational Conference on Knowledge Discovery and Data Mining (pp. 71-80). ACM Press.

Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29, 103-130.

Ferrer, F., & Aguilar, J., & Riquelme, J. (2005). Incre-mental rule learning and border examples selection from numerical data streams. Journal of Universal Computer Science, 11 (8), 1426-1439.

Gaber, M., & Zaslavsky, A., & Krishnaswamy, S. (2004). Resource-Aware Knowledge Discovery in Data Streams. In International Workshop on Knowledge Discovery in Data Streams; ECML-PKDD04 (pp. 32-44). Tech. Report, University of Pisa.

Gama, J., Fernandes, R., & Rocha, R. (2006). Decision trees for mining data streams. Intelligent Data Analysis, 10 (1), 23-45.

Gama, J., Medas, P., Castillo, G., & Rodrigues, P. (2004). Learning with drift detection. In A. L. C. Bazzan and S. Labidi (Eds.), Proceedings of the 17th Brazilian Symposium on Artificial Intelligence (pp. 286-295). LNAI 3171. Springer.

Gama, J., Medas, P., & Rodrigues, P. (2005). Learn-ing decision trees from dynamic data streams. In H. Haddad, L. Liebrock, A. Omicini, and R. Wainwright (Eds.), Proceedings of the 2005 ACM Symposium on Applied Computing (pp. 573-577). ACM Press.

Guha, S., Rastogi, R., & Shim, K. (1998). CURE: An efficient clustering algorithm for large databases. In L. Haas and A. Tiwary (Eds.), Proceedings of the 1998 ACM SIGMOD International Conference on Manage-ment of Data (pp. 73-84). ACM Press.

Hulten, G., & Domingos, P. (2001). Catching up with the data: research issues in mining data streams. In Proceedings of Workshop on Research issues in Data Mining and Knowledge Discovery.

Hulten, G., Spencer, L., & Domingos, P. (2001). Mining time-changing data streams. In Proceedings of the 7th ACM SIGKDD International conference on Knowledge discovery and data mining (pp. 97-106). ACM Press.

Jiang, N., & Gruemwald, L. (2006). Research Issues in Data Stream Association Rule Mining, SIGMOD Record, 35, 14-19.

Jin, R., & Agrawal, G. (2007). Frequent Pattern Mining in Data Sreams. In C. Aggarwal (Ed.), Data Streams: Models and Algorithms (pp. 61-84). Springer.

Kargupta, H., Joshi, A., Sivakumar, K., & Yesha, Y. (2004). Data Mining: Next Generation Challenges and Future Directions. AAAI Press and MIT Press.

Kiffer, D., Ben-David, S., & Gehrke, J. (2004). Detect-ing change in data streams. In VLDB 04: Proceedings of the 30th International Conference on Very Large Data Bases (pp. 180-191). Morgan Kaufmann Pub-lishers Inc.

Klinkenberg, R. (2004). Learning drifting concepts: Example selection vs. example weighting. Intelligent Data Analysis, 8 (3), 281–300.

Markou, M., & Singh, S. (2003). Novelty Detection: a Review. Signal Processing, 83 (12), 2499–2521.

Motwani, R., & Raghavan, P. (1997). Randomized Algorithms. Cambridge University Press.

Muthukrishnan, S. (2005). Data streams: algorithms and applications. Now Publishers.

Rodrigues, P. P., Gama, J., & Pedroso, J. P. (2006). ODAC: Hierarchical Clustering of Time Series Data Streams. In J. Ghosh, D. Lambert, D. Skillicorn, and J. Srivastava (Eds.), Proceedings of the Sixth SIAM International Conference on Data Mining (pp. 499-503). SIAM.

Sheikholeslami, G., Chatterjee, S., & Zhang, A. (1998). WaveCluster: A multi-resolution clustering approach for very large spatial databases. In Proceedings of the Twenty-Fourth International Conference on Very Large Data Bases (pp. 428-439). ACM Press.

Sousa, E., Traina, A., Traina Jr, C. & Faloutsos, C. (2006). Evaluating the Intrinsic Dimension of Evolv-ing Data Streams, In Proceedings of the 2006 ACM Symposium on Applied Computing (pp. 643-648). ACM Press.

1141

Learning from Data Streams

LSpinosa, E., Carvalho, A., & Gama, J., (2007). OLLIN-DA: A cluster-based approach for detecting novelty and concept drift in data streams. In Proceedings of the 2007 ACM Symposium on Applied Computing (pp. 448-452). ACM Press.

Wald, A. (1947). Sequential Analysis. John Wiley and Sons, Inc.

Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: An efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (pp. 103-114). ACM Press.

KEy TERMS

Adaptive Learning: Learning algorithms that self-modify their model by incorporating new information and/or forgetting outdated information.

Association rules: Rules that describe events that are frequently observed together.

Concept Drift: Any change in the distribution of the examples that describe a concept.

Data Mining: The process of extraction of useful information in large Data Bases.

Data Stream: Continuous flow of data eventually at high speed

Decremental Learning: Learning that modifies that current decision model by forgetting the oldest examples.

Feature selection: Process focussing on selecting the attributes of a dataset, which are relevant for the learning task.

Incremental Learning: Learning that modifies that current decision model using only the most recent examples

Machine Learning: Programming computers to optimize a performance criterion using example data or past experience.

Novelty Detection: The process of identification new and unknown patterns that a machine learning system is not aware of during training.

Online Learning: Learning by a sequence of predic-tions followed by rewards from the environment.