special section: data mining in grid computing environments

3
Future Generation Computer Systems 23 (2007) 31–33 www.elsevier.com/locate/fgcs Editorial Special section: Data mining in grid computing environments Vlado Stankovski a,* , Werner Dubitzky b,1 a University of Ljubljana, Faculty of Civil and Geodetic Engineering, Jamova 2, SI-1000 Ljubljana, Slovenia b University of Ulster, School of Biomedical Sciences, Cromore Road, Coleraine BT52 1SA, United Kingdom Available online 27 June 2006 Data mining can be viewed as the formulation, analysis, and implementation of an induction process proceeding from specific data to general patterns that facilitates the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. Data mining ranges from highly theoretical mathematical work in areas like statistics, machine learning, knowledge representation, and algorithms to systems solutions for problems like fraud detection, modeling of cancer and other complex diseases, network intrusion, information retrieval on the Web, and monitoring of grid systems. Data mining techniques are increasingly employed in traditional scientific discovery disciplines, such as biological, medical, biomedical, chemical, physical, and social research, and a variety of other knowledge industries, such as government, education, high-tech engineering, and process automation. Thus, data mining is playing an increasingly important role in structuring and shaping future knowledge-based industries and businesses. The effective and efficient management and use of stored data, and in particular the transformation of these data into information and knowledge, is considered a key requirement for success in such domains. In the past, research in data mining has mainly been concerned with small to moderately sized data sets and knowledge-weak domains (e.g., market and retail applications) with focus on largely homogeneous and localized computing environments. These assumptions are no longer met in modern scientific and industrial complex problem-solving environments, which are increasingly relying on the sharing of geographically dispersed computing resources. Such * Corresponding author. Tel.: +386 (0) 1 4768511; mobile: +386 (0) 41 200565; fax: +386 (0) 1 4250681. E-mail addresses: [email protected] (V. Stankovski), [email protected] (W. Dubitzky). URLs: http://www.stankovski.net (V. Stankovski), http://research. bioinformatics.ulster.ac.uk/ (W. Dubitzky). 1 Tel.: +44 (0) 2870 324478; fax: +44 (0) 2870 324375. knowledge-based sectors are characterized by an ever- increasing amount of digital data, information, and knowledge generated by the underlying activities and processes. Therefore, future knowledge discovery applications will need to operate on large (gigabytes) and very large (terabytes) data sets and against highly structured and complex domain knowledge available in digital form. The data sets, the domain knowledge, and the programs for processing, analyzing, evaluating, and visualizing these data, and other relevant resources will increasingly reside at geographically distributed sites on heterogeneous infrastructures and platforms (e.g., operating systems, hardware and software architectures and systems). Secure, effective, efficient, and user-friendly data mining tools, systems, and infrastructures are required to facilitate data mining of such distributed and heterogeneous data and information resources. The requirements arising from such large-scale, distributed data mining scenarios are extremely challenging and it is unlikely that a single solution will emerge that meets them all. However, currently two new network-based computer technologies are emerging which hold the promise to be part of future complex data mining solutions: Web services and grid computing. Grid refers to persistent computing environments that enable software applications to integrate instruments, displays, computational, and information resources that are managed by diverse organizations in widespread locations. Web services are broadly regarded as self-contained, self- describing, modular applications that can be published, located, and invoked across the Internet. Current developments are designed to bring about a convergence of grid and Web services technology (e.g., service-oriented architectures and the Open Grid Services Architecture). The aim of this Special Section is to explore the challenges involved, applications developed, and lessons learned from efforts in bringing data mining to modern grid computing and Web services environments. Topics of interest include: – Architectures for data mining in grid computing environ- ments; 0167-739X/$ - see front matter c 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.future.2006.05.001

Upload: vlado-stankovski

Post on 21-Jun-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Future Generation Computer Systems 23 (2007) 31–33www.elsevier.com/locate/fgcs

∗ Corres200565; fa

E-mailw.dubitzky

URLs:bioinforma

1 Tel.: +4

0167-739Xdoi:10.101

Editorial

Special section: Data mining in grid computing environments

Vlado Stankovskia,∗, Werner Dubitzkyb,1

a University of Ljubljana, Faculty of Civil and Geodetic Engineering, Jamova 2, SI-1000 Ljubljana, Sloveniab University of Ulster, School of Biomedical Sciences, Cromore Road, Coleraine BT52 1SA, United Kingdom

Available online 27 June 2006

Data mining can be viewed as the formulation, analysis,and implementation of an induction process proceeding fromspecific data to general patterns that facilitates the nontrivialextraction of implicit, previously unknown, and potentiallyuseful information from data. Data mining ranges from highlytheoretical mathematical work in areas like statistics, machinelearning, knowledge representation, and algorithms to systemssolutions for problems like fraud detection, modeling of cancerand other complex diseases, network intrusion, informationretrieval on the Web, and monitoring of grid systems. Datamining techniques are increasingly employed in traditionalscientific discovery disciplines, such as biological, medical,biomedical, chemical, physical, and social research, and avariety of other knowledge industries, such as government,education, high-tech engineering, and process automation.Thus, data mining is playing an increasingly important rolein structuring and shaping future knowledge-based industriesand businesses. The effective and efficient management anduse of stored data, and in particular the transformation ofthese data into information and knowledge, is considered a keyrequirement for success in such domains.

In the past, research in data mining has mainly beenconcerned with small to moderately sized data sets andknowledge-weak domains (e.g., market and retail applications)with focus on largely homogeneous and localized computingenvironments. These assumptions are no longer met inmodern scientific and industrial complex problem-solvingenvironments, which are increasingly relying on the sharingof geographically dispersed computing resources. Such

ponding author. Tel.: +386 (0) 1 4768511; mobile: +386 (0) 41x: +386 (0) 1 4250681.addresses: [email protected] (V. Stankovski),@ulster.ac.uk (W. Dubitzky).http://www.stankovski.net (V. Stankovski), http://research.tics.ulster.ac.uk/ (W. Dubitzky).4 (0) 2870 324478; fax: +44 (0) 2870 324375.

/$ - see front matter c© 2006 Elsevier B.V. All rights reserved.6/j.future.2006.05.001

knowledge-based sectors are characterized by an ever-increasing amount of digital data, information, and knowledgegenerated by the underlying activities and processes. Therefore,future knowledge discovery applications will need to operate onlarge (gigabytes) and very large (terabytes) data sets and againsthighly structured and complex domain knowledge available indigital form. The data sets, the domain knowledge, and theprograms for processing, analyzing, evaluating, and visualizingthese data, and other relevant resources will increasinglyreside at geographically distributed sites on heterogeneousinfrastructures and platforms (e.g., operating systems, hardwareand software architectures and systems). Secure, effective,efficient, and user-friendly data mining tools, systems, andinfrastructures are required to facilitate data mining of suchdistributed and heterogeneous data and information resources.

The requirements arising from such large-scale, distributeddata mining scenarios are extremely challenging and it isunlikely that a single solution will emerge that meets themall. However, currently two new network-based computertechnologies are emerging which hold the promise to be partof future complex data mining solutions: Web services and gridcomputing. Grid refers to persistent computing environmentsthat enable software applications to integrate instruments,displays, computational, and information resources that aremanaged by diverse organizations in widespread locations.Web services are broadly regarded as self-contained, self-describing, modular applications that can be published, located,and invoked across the Internet. Current developments aredesigned to bring about a convergence of grid and Web servicestechnology (e.g., service-oriented architectures and the OpenGrid Services Architecture). The aim of this Special Sectionis to explore the challenges involved, applications developed,and lessons learned from efforts in bringing data mining tomodern grid computing and Web services environments. Topicsof interest include:

– Architectures for data mining in grid computing environ-ments;

32 V. Stankovski, W. Dubitzky / Future Generation Computer Systems 23 (2007) 31–33

– Semantics in the data mining process, identification ofresources for data mining, such as data sources, data miningprograms, storage and computing capacity to run large-scalemining jobs, provenance tracking mechanisms;

– Data privacy and security issues;– Data types, formats, and standards for mining data,

which become more important in distributed computingenvironments;

– Approaches to mining inherently distributed data, i.e. datathat for one reason or another cannot be physically integratedon a single node or computer;

– Data mining of truly large and high-dimensional data sets,e.g. data sets that do not fully fit into local memory;

– Tools and languages facilitating data mining in distributedcomputing environments, e.g. workflow concepts, visualiza-tion tools, user interfaces, programming models and lan-guages;

– Adaptation of existing and development of new data miningalgorithms that can exploit parallel computing architectures;

– Applications of data mining algorithms, tools and systems ingrid, Web service and peer-to-peer systems;

– Theoretical foundations of data mining approaches indistributed computing environments.

The Call for Papers for this Special Section was published inthe Spring of 2005 with a final deadline for paper submissionby the end of October 2005. In response, we received a totalof seventeen manuscripts. Following the reviewing process, tenpapers were accepted for publication in the Special Section.These articles focus on architectural issues, applications as wellas meta-scheduling. The following is a brief overview.

Congiusta, Talia, and Trunfio deal with the problems ofknowledge discovery in large data repositories. They presenta high-level framework, called the Knowledge Grid, thatprovides grid-based knowledge discovery tools and services.Knowledge Grid’s WSRF-compliant services allow users tocreate and manage complex knowledge discovery applicationsthat integrate data sources and data mining tools. The paperhighlights some design aspects and implementation choicesinvolved in the WSRF-enabling process.

Perez, Sanchez, Robles, Herrero, and Pena present a novelData Mining Grid Architecture (DMGA), that addresses theproblem of execution of complex data mining processes ina grid environment. The DMGA is designed to facilitate thedeployment of several data mining libraries by means of gridservices. In the provided example, which follows the DMGA,a data mining process is composed from two services AprioriGand GridFTP.

Natarajan, Sion, Apte, and Narang describe a grid-basedapproach to enterprise-scale data mining. In their distributedenvironment, there are high-performance computational serversand the data are stored on several (distributed) relationaldatabase systems. Their approach relies on a simple algorithmicdecomposition of the data mining kernel on the data andcomputational grids, while minimizing the data transferbetween them.

Cannataro, Guzzi, Mazza, and Veltri propose the use ofontologies to describe the semantics of basic pre-processing

services as well as other data mining applications (e.g., goals,results, needed tools, data transformation activities) and toguide biologists in formulating their problems and choosingpossible solutions and related services. The proposed approachis implemented in a system called MS-Analyser, a software toolfor integrated management, pre-processing and mining of massspectrometry proteomics data in grid environments.

Luo and Shi deal with the problem of using agenttechnology in order to develop data mining applications thatwill be executed in grid environments. They present theAGrIP architecture and its implementation. A user-friendly andextensible development toolkit VAStudio is also presented toaid users developing data mining applications.

Wurst and Morik focus on scenarios in which a largenumber of loosely coupled nodes apply data mining to small,overlapping subsets of the entire data space. The aim is not tofind a global concept to covers all data, but to learn a set of localconcepts. Their prototypical application is actually a distributedmedia organization platform, called Nemoz, that assists users inmaintaining their media collections. They propose a centralizedand a fully decentralized peer-to-peer model for distributedfeature extraction in Nemoz and evaluate both on a real-worlddata set.

Maran, Sild, Kahn, and Takkis provide an application of theOpenMolGRID system to molecular design and engineering.The application uses QSAR/QSPR as the core applicationincluding tools for building automated scientific workflows ontop of the Unicore grid middleware. A study modeling theinhibition of aspartyl protease enzyme is presented.

The research of Luo, Lu, Shi, and He is motivated by theneed for practical solutions to the meta-scheduling problemwhen executing distributed data mining (DDM) workflows.They propose a novel two-phase scheduling framework, thatincludes external and internal scheduling for global (InterGrid)and local (IntraGrid) grids, respectively. Their system isimplemented in an established Multi-Agent System (MAS)environment, in which the reuse of existing data miningalgorithms is achieved by encapsulating them into agents.

Li, Groep, and Wolters also deal with the meta-schedulingproblem. They propose an instance-based learning technique topredict job response times by mining historical performancedata. For this purpose they introduce policy attributesin representing and comparing resource states, which aredefined as the pools of running and queued jobs on thecomputational resources at the time of making predictions. Thepolicy attributes reflect the local scheduling policies and areautomatically discovered using genetic search.

Furtado, Flavio de Souza, Ayres, and Cirne propose andevaluate a fully-functional knowledge-based meta-scheduler,which provides abstractions to the data mining developer andoptimizes the data mining application at runtime.

We feel that the contributions to this Special Section providehighly interesting and representative insights into current issuesand developments of data mining in grid, peer-to-peer, and Webservices environments. We hope that this Special Section willinstigate new ideas for those working in this exciting field.

V. Stankovski, W. Dubitzky / Future Generation Computer Systems 23 (2007) 31–33 33

Finally, we wish to thank the authors for their contributionsand express our gratitude to Xiaoping Sun, Andrea Pugliese,Uros Lotric, Claudi Paniagua Macia, Matevz Dolenc, VictorRobles, Ali Shaikh Ali, Jernej Trnkoczy, Hashim Mohamed,Martin Thomas Swain, Assaf Schuster, Maria S. Perez, PaoloTrunfio, Antun Balaz, Ljupo Todorovski, Mathilde Romberg,Felix Heine, Ralph Muller-Pfefferkorn, and Craig Lee for theirhelp in reviewing the manuscripts. We would also like to thankAssociate Editor Marian Bubak for the opportunity to developthis Special Section. Our work in guest-editing this SpecialSection was partly facilitated by the EC FP6 grant IST-2002-004475 DataMiningGrid, http://www.DataMiningGrid.org.

Vlado Stankovski was awarded his B.Sc. and M.Sc.degrees in computer science from the University ofLjubljana in 1995 and 2000, respectively. He beganhis career in 1995 as a consultant and later as projectmanager with the Fujitsu-ICL Corporation in Prague.From 1998–2002 he worked as a researcher at theUniversity Medical Centre in Ljubljana. From 2003on, he is with the Department of Civil Informaticsat the Faculty of Civil and Geodetic Engineering.

Currently, he is the technical manager of the EU IST DataMiningGrid projectand financial manager of the EU IST InteliGrid project. He participatesin the Slovene national grid-related projects: GridForum.si, AgentGrid andSiGNet. He specializes in grid and semantic technologies as well asapplications of machine learning techniques to engineering and medicalproblems.

Professor Werner Dubitzky holds a Chair ofBioinformatics at the School of Biomedical Sciences,Faculty of Life and Health Science, at the Universityof Ulster in Coleraine, Northern Ireland since January2002. He received his B.Sc. in Electrical Engineeringfrom the Augsburg University of Applied Sciences(Germany) in 1991, and his Ph.D. in artificialintelligence, knowledge-based systems and machinelearning from the University of Ulster, Northern

Ireland, in 1997. In July 1992 he joined the School of Information andSoftware Engineering at the University of Ulster as a Research Officer.At the same institute he took up a position as Research Fellow in 1997,and became a Lecturer in 1999. In January 2000 he joined the IntelligentBioinformatics Systems Group at the German Cancer Research Center inHeidelberg. Currently, his main areas of interest include bioinformatics,computational systems biology, data mining and data management, textmining, and grid technology. He has published over 80 papers in theseareas.