identifying optimal spatial groups for maximum coverage in ubiquitous sensor network by using...

21
Hindawi Publishing Corporation International Journal of Distributed Sensor Networks Volume 2013, Article ID 763027, 20 pages http://dx.doi.org/10.1155/2013/763027 Research Article Identifying Optimal Spatial Groups for Maximum Coverage in Ubiquitous Sensor Network by Using Clustering Algorithms Simon Fong, 1 Weng Fai Ip, 1 Elaine Liu, 1 and Kyungeun Cho 2 1 Department of Computer and Information Science, University of Macau, Macau 2 Department of Multimedia Engineering, Dongguk University-Seoul, Seoul 100-715, Republic of Korea Correspondence should be addressed to Simon Fong; [email protected] Received 23 March 2013; Accepted 2 June 2013 Academic Editor: Sabah Mohammed Copyright © 2013 Simon Fong et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Ubiquitous sensor network has a history of applications varying from monitoring troop movement during battles in WWII to measuring traffic flows on modern highways. In particular, there lies a computational challenge in how these data can be efficiently processed for real-time intelligence. Given the data collected from ubiquitous sensor networks that have different densities distributed over a large geographical area, one can see how separate groups could be formed over them in order to maximize the total coverage by these groups. e applications could be either destructive or constructive in nature; for example, a jet fighter pilot needs to make a real-time critical decision at a split of second to locate several separate targets to hit (assuming limited weapon payloads) in order to cause maximum damage, when it flies over an enemy terrain; a town planner is considering where to station certain resources (sites for schools, hospitals, security patrol route planning, airborne food ration drops for humanitarian aid, etc.) for maximum effect, given a vast area of different densities for benevolent purposes. is paper explores this problem via optimal “spatial groups” clustering. Simulation experiments by using clustering algorithms and linear programming are to be conducted, for evaluating their effectiveness comparatively. 1. Introduction Ubiquitous sensor network is a kind of wireless sensor technology [1] that has sensors distributed far and wide, usually covering a large geographical area like forest, battle field, or road networks of an urban city. Few successful case scenarios have been in place in the literature, such as monitoring vegetable freshness by using oxygen and carbon dioxide sensors in farms [2], chemical leak detection in hazardous sites [3], general-purpose sensor networks that monitor fire [4], and operation underwater [5]. What these applications have in common is the need of a postprocessing step that crunch over the data, possibly in real time, and to make a quick and accurate prediction out of the analysis. In this paper, we consider a special case of postprocessing of such ubiquitous sensor network data. Given a vast distribu- tion of sensors each of which collects some information about the local proximity, some groups or clusters are to be formed over those. e groups should be formed in such a way that the total overall “value” of all the values from all the groups must be maximized. e value(s) which should be part of the attribute information being collected by the sensors may be something that is of the user’s concern. e values usually represent the density of a proximity where a sensor stands, for example, concentrate of some chemical gas, traffic volume, importance of military target, or even head counts of castles or humans. Intuitively one would prefer the groups to be centered on the most valuable values over the area; the groups should not overlap much of each other, so the overlapped effect may even get cancelled out or wasted in vain. Here some reasonable assumptions would have to be held valid: each group would have a limited diameter of effect; each group is in the shape of a concentric circle; the areas where the circles (groups) cover sum up to a total coverage a.k.a. maximum net effect; and we can form only a limited number of such circles. is would be an interesting mathematical problem but it has a significant impact on ubiquitous sensor network applications. It not only

Upload: umac

Post on 22-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Hindawi Publishing CorporationInternational Journal of Distributed Sensor NetworksVolume 2013 Article ID 763027 20 pageshttpdxdoiorg1011552013763027

Research ArticleIdentifying Optimal Spatial Groups for Maximum Coverage inUbiquitous Sensor Network by Using Clustering Algorithms

Simon Fong1 Weng Fai Ip1 Elaine Liu1 and Kyungeun Cho2

1 Department of Computer and Information Science University of Macau Macau2Department of Multimedia Engineering Dongguk University-Seoul Seoul 100-715 Republic of Korea

Correspondence should be addressed to Simon Fong ccfongumacmo

Received 23 March 2013 Accepted 2 June 2013

Academic Editor Sabah Mohammed

Copyright copy 2013 Simon Fong et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Ubiquitous sensor network has a history of applications varying from monitoring troop movement during battles in WWIIto measuring traffic flows on modern highways In particular there lies a computational challenge in how these data can beefficiently processed for real-time intelligenceGiven the data collected fromubiquitous sensor networks that have different densitiesdistributed over a large geographical area one can see how separate groups could be formed over them in order to maximize thetotal coverage by these groupsThe applications could be either destructive or constructive in nature for example a jet fighter pilotneeds to make a real-time critical decision at a split of second to locate several separate targets to hit (assuming limited weaponpayloads) in order to cause maximum damage when it flies over an enemy terrain a town planner is considering where to stationcertain resources (sites for schools hospitals security patrol route planning airborne food ration drops for humanitarian aid etc)for maximum effect given a vast area of different densities for benevolent purposes This paper explores this problem via optimalldquospatial groupsrdquo clustering Simulation experiments by using clustering algorithms and linear programming are to be conductedfor evaluating their effectiveness comparatively

1 Introduction

Ubiquitous sensor network is a kind of wireless sensortechnology [1] that has sensors distributed far and wideusually covering a large geographical area like forest battlefield or road networks of an urban city Few successfulcase scenarios have been in place in the literature such asmonitoring vegetable freshness by using oxygen and carbondioxide sensors in farms [2] chemical leak detection inhazardous sites [3] general-purpose sensor networks thatmonitor fire [4] and operation underwater [5] What theseapplications have in common is the need of a postprocessingstep that crunch over the data possibly in real time and tomake a quick and accurate prediction out of the analysis

In this paper we consider a special case of postprocessingof such ubiquitous sensor network data Given a vast distribu-tion of sensors each of which collects some information aboutthe local proximity some groups or clusters are to be formedover those The groups should be formed in such a way that

the total overall ldquovaluerdquo of all the values from all the groupsmust be maximizedThe value(s) which should be part of theattribute information being collected by the sensors may besomething that is of the userrsquos concern The values usuallyrepresent the density of a proximity where a sensor standsfor example concentrate of some chemical gas traffic volumeimportance of military target or even head counts of castlesor humans

Intuitively one would prefer the groups to be centered onthe most valuable values over the area the groups should notoverlapmuch of each other so the overlapped effectmay evenget cancelled out or wasted in vain Here some reasonableassumptions would have to be held valid each group wouldhave a limited diameter of effect each group is in the shape ofa concentric circle the areas where the circles (groups) coversum up to a total coverage aka maximum net effect and wecan form only a limited number of such circlesThis would bean interesting mathematical problem but it has a significantimpact on ubiquitous sensor network applications It not only

2 International Journal of Distributed Sensor Networks

determines on how we should distribute the sensors butalso after the deployment how these logical groups are beingformed possibly for further applications

For experiments we attempted to apply several clusteringalgorithms the choices of these algorithms are classical andpopular in data mining research community The effec-tiveness of different clustering algorithms is measured forcomparison However none of the clustering algorithms canachieve the best results At the end we develop a simple andnovelmethod based on linear programming for optimizationwhich we called LP LP is shown to be able to achieveoptimal grouping over different configurations and cases ofexperiments

The contribution of the paper is an in-depth investiga-tion into the grouping problem that arises right after thedeployment of a ubiquitous sensor network We propose anovel solution to achieve optimal groups by using linearprogramming though several clustering algorithms havebeen put into test

The remaining of the paper is structured as followsSection 2 introduces the background techniques of spatialclustering Section 3 surveys on spatial data representa-tion how the spatial data are encoded for postprocessingSection 4 describes our methodology for obtaining optimalgroups over spatial data Section 5 reports about the exper-iments Section 6 analyses and compares the experimentalresults Section 7 concludes the paper

2 Overview of Spatial DataClustering Techniques

Clustering is the organization of a dataset into homogeneousandor well-separated groups with respect to a distance orequivalently a similarity measure [6] Spatial data clusteringhas numerous applications in pattern recognition [7 8]spatial data analysis [9ndash11] market research and so forth [1213] which gather data to find all non concentrate models andspecial things among geographical dataset And spatial dataclustering is an important instrument of spatial data miningwhich has become a powerful tool for efficient and complexanalysis of huge spatial databases [12]with geometric featuresand is liked by most people Conventional methods ofclustering algorithm are classified into four types partitionmethod hierarchical method density-based method andgrid-based method

However with the extension of research objects andscope it has been discovered to have shortcomings Manyexisting spatial clustering algorithms cannot cluster withirregular obstacles reliably A grid-density-based hierarchicalclustering (HC) algorithm is proposed to tackle this problemThe advantage of grid-based clustering algorithm is reducingthe quality of calculation An alternative approach [14] isproposed that can effectively form clustering in the presenceof obstacles The shapes of the clusters can be arbitrarilydefined Moreover the hierarchical strategy is used to reducethe complexity in presence of obstacles and constraints andto improve the operation efficiency [13] And the result isthat it can deal with spatial clustering while it faces obstacles

and constraints and get better performanceWhen some datapoints do not work in any cluster for density clustering thissituation was managed by using grid-based HC instead inthis study And each clustering algorithm has its individualadvantages and disadvantages

Thepartition approach separates119873objects into119898 groupsmeanwhile 119898 satisfies the following constraints firstly eachgroup contains one object at least secondly each objectmust belong to one group In order to achieve a globaloptimum in grouping it is necessary to list all possibleclusters and most applications will adopt KM K-medoidor fuzzy analysis However this partitioning method consistsof some problems when applied to spatial mining towardsclustered objects especially the objects that are obstructedby some environment conditions Such as river it is hard torecognize the comparability

This method can be efficiently carried out by clusteringno matter how large the number of objects Cluster analysisalgorithm in general cannot deal with large datasets It isrecommended that the maximum number of objects to dealwith in this method should be no more than 1000 It is astochastic huntingway based on partitioning in the clusteringmethod due to its low efficiency and the capability of thismethod is much affected by the random selection of thestochastic initial value [14]

HC is another popular clustering method which ismore flexible than partitioning-based clustering but it has ahigher time complexity HC algorithms create a hierarchicaldecomposition (aka dendrogram) of dataset based on somecriterion [15 16] According to the rule of generation inhierarchical decomposition there are two different types ofHC methods agglomerative and divisive For agglomerativealgorithm it starts with producing leaves and it combinesclusters in a bottom-up way For divisive algorithm theclustering starts at the root and it recursively separates theclusters in a top-down way The process continues until astopping criterionmdashusually it stops when the required 119896

clusters are accomplished However this hierarchical methodconsists of some problems vagueness of termination criteriaonce a step is complete it cannot be revoked As long as theneighborhood density (the number of objects or data points)does not grow over a certain threshold clustering continues[17] In other words for a given point in each cluster in theneighborhood of a given radius it must contain at least aminimum number of points As a result the noisy data canbe filtered and better clusters with arbitrary shape can befound DBScan and its expansion algorithm which is calledOPTICS are two of these classical density-based methodsThey perform clustering based on the type of density-basedconnectivity

Grid-based method quantifies object space to a restrictednumber of cells forming a grid structure All clusteringoperations are in the grid structure (ie quantitative space)The main benefit of this method is its high speed therun time is usually not restricted by the data size whichonly depends on the number of cells in each dimensionThe algorithm STING (statistical information grid-basedmethod) [18] works with numerical attributes (spatial data)and is designed to facilitate ldquoregion-orientedrdquo queries

International Journal of Distributed Sensor Networks 3

Nevertheless the spatial groups obtained by classic algo-rithms have certain limitations that is overlaps cannot becontrolled and the maximum coverage by the resultantgroups is not guaranteed Overlaps lead to resource wasteand potentially resourcemismatch Besides spatial clusteringthis situation occurs in other fields of applications suchas the information retrieval (several thematic for a singledocument) biological data (several metabolic functions forone gene) and martial purpose (discover object densenessregions independently) However there has been no studyreported in the literature that the authors are aware ofthat applies LP method to discover spatial groups that arefree of the limitations inherited from clustering algorithmsThus this research provides an alternative method to achievespatial groups for maximum coverage in real environmentMaximum coverage in this context here is defined as thegreatest possible area of effect covered by the spatial groupswith none or minimum overlaps among the groups

3 Spatial Data Representation

Two main categories of spatial data representation existspatial data and attribute data Spatial data means refer-enced data in the earth such as maps photographs andsatellite imageries Though these representation techniquesoriginated from GIS the underlying coding formats arecommon compared to those for wireless sensor networksas long as they are distributed over a wide spatial area innature Generally spatial data represents geographic featuresin complete and relative locations Attribute data representsthe spatial features in characteristics which can be in quantityandor in quality in real world Attribute data is often referredto as tabular data In our experiments we test both typesof data models versus different clustering algorithms for athorough investigation

31 Spatial Data Model In early days spatial data is storedand represented in a map format There are three fundamen-tal types of spatial data models for recording the geographicdata digitally They are vector raster and image

Figure 1 as shown in the following illustrates the encodingtechniques of two important spatial data [19] which are rasterand vector over a sample aerial image of Adriatic Sea andcoast in Italy The image type of encoding is very similar toraster data in terms of usability of techniques But it has a limitin internal formats when it comes to modeling and analysisof the data Images represent photographs or pictures in thelandscape in a coarse matrix of pixel values

32 Vector Data Model The three kinds of a forementionedspatial data models are used in storing the geographiclocation with spatial features in dataset The vector datamodel uses 119909 119910 coordinates to define the locations offeatures thereafter theymark points lines areas or polygonsTherefore vector data tend to define centers edges andoutlines of features It characterizes the features by linearsegments using sequential points or vertices A vertex consistsof a pair of 119909 and 119910 coordinates The beginning or ending

of a node is defined in each vertex with arc segment Asingle coordinate pair of vertexes defines a feature point Agroup of coordinate pairs define polygonal features In vectorrepresentation as well as the connectivity between featuresthe storage of the vertices for each feature is important as wellas the sharing of common vertices where features connect

By using the same size polygonal we divide a completemap into small units based on the character of our databasewhich is represented to be (119909 119910 V) where 119909 and 119910 consistof an coordinate pair that represents the referenced spatialposition and V represents something of interest or just calledldquofeaturerdquo which could be a military target a critical resourceor just an inhabitant clan for example The greater the V themore valuable the feature is In spatial grouping formaximumcoverage we opt to include these features that amount to ahighest total value A sample of vector format that representsa spatial location in reference to 2D is shown in Figure 2 [19]

33 RasterDataModel Raster datamodelsmake use of a gridof squares to define where features are located These squareswhich are also called pixels or cells typically are of uniformsize

From our dataset we separate the whole image by impos-ing a grid on it hence producing many individual featureswith one feature corresponding to each cell We considerusing raster data model to represent the dataset and we storethe features by the following two different encoding formats

(1) Raster data are stored as an ordered list of cell valuesin pairs of (119894 V) where 119894 is a sequential number of thecell indices and V is the value of the 119894th feature forexample (1 80) (2 80) (3 74) (4 62) (5 45) and soon as shown in Figure 3

(2) Raster data are stored as points (119909 119910 V) with 119909 and 119910

as position coordinates locating to the correspondingspatial feature with value V for example (1 1 513) (12 514) (1 3 517) (2 1 512) (2 2 515) and so on asshown in Figure 4 In this case the value V refers tothe center point of the grid cell This encoding will beuseful for representing measured values at the centerpoint of the cell for example raster of elevation

(3) During the experiment the grid size is transformedfor efficient operation So we put 1198942 cells together asone unit representing one new grid cell as shown inFigure 5

In particular the quad tree data structure for storingthe data is found to be useful as an alternative encodingmethod to raster data model Raster embraces digital aerialphotographs imagery from satellites digital pictures or evenscanned maps Details on how different sorts of objects likepoint line polygon and terrain are represented by the datamodels can be found in [19ndash21]

4 Proposed Methodology

Theaim of themethodology is to determine a certain numberof clusters and their corresponding locations from some

4 International Journal of Distributed Sensor Networks

Real world Vector Rasterimage

Figure 1 Representation of how a real-world spatial area is represented by vector and raster encoding formats

13121110

987654321

0 1 2 3 4 5 6 7 8 9 10 11 12 13Width

Hei

ght

Colu

mn

Row

x y coordinates are 9 3

Figure 2 Vector format

80 74 62 45 45 34 39 56

80 74 74 62 45 34 39 56

74 74 62 62 45 34 39 39

62 62 45 45 34 34 34 39

45 45 45 34 34 30 34 39

Figure 3 Raster format in ordered list

collected spatial data In this process different methods aretested for choosing the one which covers themost area as wellas the highest feature values from the suggested clustersTheflow of this process including preprocessing of sensor data

515 519 521 523

523

523517 528 527

512 510 520

518511 512 516

514

517

511

510

512 516 517 520513

Figure 4 Raster data with center point

80

80

74

62

45

74

74

74

62

45

62

74

62

45

45

45

62

62

45

34

45

45

45

34

34

34

34

34

34

30

39

39

39

34

34

56

56

39

39

39

Figure 5 Raster format with 22 and 32 grids

data transformation clustering and finding cluster center-points is shown in Figure 6

In case of a satellite image or image captured by fighter-jetor other surveillance camera image processing is needed to

International Journal of Distributed Sensor Networks 5

Load spatial image

RGB image

Gray imageSkeleton extraction

Morphological operation in MATLAB (Bwmorph)

Zhangrsquos algorithmare used forcomparison

Two-tone image

Indexed grid image

2D special data

Method comparison

Griddingindexing image

Numerical dataset (with normalization)

Spatial grouping

Hierarchical K-means LPDBScan

Color map

Output

2 algorithms

middot middot middot

Preprocessing of imageData transformation

GroupingDisplay

Figure 6 Workflow of proposed methodology

extract the density information from pictures But in our caseof sensor network we can safely assume that the data fed fromanet of sensorswould have the sensor ID attachedThe sensorIDs are known so are their positions From the locations ofthe sensors and their sensor ID we could possibly relate thedata that was collected to their corresponding locations inthe 119909-119910 format of coordinates (assume the terrain is of 2D)In order to reduce the huge amount of calculation and storagespace a grid was used to divide the whole map into smallerpieces The grid indexing operation is repeated for a range ofdifferent coarse layers thereby providing different resolutionsof data partitions Similar technique is reported in [22] whichis computed by Euclidian distance Obviously the methodof grid indexing helps separate data into cells based on theirgeographic locations

To obtain a better result of spatial groups for maximumcoverage and its corresponding cluster center point with cer-tain constrains the research adopts several popular cluster-ing methods and linear programming method by using soft-ware programs such as XLMiner (httpwwwsolvercomxlminer-data-mining) MATLAB (httpwwwmathworkscomproductsmatlab) and Weka (httpwwwcswaikatoacnzmlweka)

The core purpose of cluster analysis is to comprehend andto distinguish the extent of similarity or dissimilarity amountof the independent clustered objects There are five majormethods of clusteringmdashKM EM XM HC and DBScan

119870-means (KM) byMacQueen 1967 is one of the simplestalgorithms that solve thewell-known clustering problem [23]It is an easy and simple method to divide a dataset into acertain number of clusters initially assuming that the numberof clusters is 119896 fixed a priori for each cluster which is themainidea The random choice of the initial location of centroids

leads to various results A better choice is to place them asmuch far away from each other as possible

The KM algorithm aims at minimizing an objective fun-ction In this case a squared error function is as follows

119895 = sum

forall119894

sum

forall119895

10038171003817100381710038171003817119909119894(119895) minus 119888

119895

10038171003817100381710038171003817

2

(1)

where 119895 ranges from 1 to 119896 119894 range from 1 to 119899 and119909119894(119895) minus 119888

1198952 is a chosen distance measure between a data

point 119909119894(119895) and the cluster center 119888

119895 which is an indicator of

the distance of the 119899 data points from their respective clustercenters The sum of distances or sum of squared Euclideandistances from the mean of each cluster is a quite normal orusual measure for causing scattering in all directions in thecluster in order to test the suitability of the KM algorithmClusters are often computed using a fast heuristic methodwhich generally produces good (but not necessarily optimal)solutions

X-Means [24] is an optimal method of KM whichimproves structure part in the algorithm Division of thecenters is attempted in its region It makes decision betweenthe root and children of each center to doing the comparisonbetween the two structures Another improved variant ofKM called EM which execrates maximization makes anassignment on a probability distribution to each further pointwhich represents the probability Howmany clusters to be setup are to be decided by EM using cross-validation

Density-based algorithms regard clusters as dense areasof objects that are separated by less dense areas [25] Becausethey have no limit to look for clusters with spherical shapethey can produce clusters with arbitrary shapes DBScan isa typical implementation of density-based algorithms calleddensity-based spatial clustering of applications with noise

6 International Journal of Distributed Sensor Networks

(a) (b) (c)

Figure 7 Illustration of possible ways of assigning clusters for maximum (a) fish population (b) altitude of terrain and (c) human inhabitantpopulation

[25] The notions of density reachability and density con-nectivity are used as performance indicators for the qualityof clustering [26] A cluster is composed of the group ofobjects in a dataset that are density connected to a particularcenter Any object that falls beyond a cluster is considered asnoise

Ward proposed a clustering method called hierarchicalclustering (HC) in 1963 [27] It tries to find how to formsomething to divide 119875

119899 119875119899minus1

1198751in a way that reduces

the relationship with each group In each step analysisstep it considered every possible cluster pair in group andcombined the two clusters with a very close joining of resultsin ldquoinformation lossrdquo which is given definition by Wardaround ESS (an error sum-of-squares criterion)The idea thatsupports Wardrsquos proposal can be described most simply bythinking of a little single data Take ten objects with scores asan example (2 7 6 6 7 2 2 0 2 0) The loss of informationwould be achieved by calculating ESS with a mean of 34which takes into account the ten scores as a unit as followsESS One group = (2 minus 34)

2+ (7 minus 34)

2+ sdot sdot sdot + (0 minus 34)

2=

4728 However those 10 objects can also be separated intofour groups according to their scores 0 0 0 2 2 2 26 6 and 7 7 Finally for evaluation of the ESS as a sum ofsquares we can obtain four independent error sums of eachsquare Overall the result that divides the 10 objects into 4clusters has no loss of information as follows

ESS One group = ESS group1 + ESS group2

+ ESS group3 + ESS group4 = 0

(2)

The last method we adopted here is linear programming(LP) which contains instituting and producing an answerto optimization problems with linear objective functionsand linear constraints This powerful tool can be used inmany fields especially where many options are possible inthe answers In spatial grouping over a large grid manypossible combinations of positioning the clusters exist Theproblem here is to find a certain number of clusters of

equal size over the area meanwhile the chosen centers ofthe clusters must allow sufficient distance apart from eachother so as to avoid overlapping As an example shownin Figure 7 three clusters would have to be assigned overa spatial area in a way that they would have to covercertain resources The assignment of the clusters howeverwould have to yield a maximum total value summed fromcovered resources In the example the left diagram showsallocating three clusters over the deepwater assuming that theresources are fish hence the maximum harvest The secondexample in the middle of Figure 7 is clustering the highaltitude over the area The last example is trying to coverthe maximum human inhabitants which are concentratedat the coves Given many possible ways of setting up theseclusters LP is used to formulate this allocation problemwith an objective of maximizing the values of the coveredresources

Assuming that the resources could be dynamic forexample animal herds or moving targets whose positionsmay swarm and change over time the optimization is atypical maximal flow problem (or max flow problem) Theoptimization is a type of network flow problem in whichthe goal is to determine the maximum amount of flowthat can occur over arc whish is limited by some capacityrestriction This type of network might be used to modelthe flow of oil in pipeline (in which the amount of oil thatcan flow through a pipe in a unit of time is limited by thediameter of the pipe) Traffic engineers also use this type ofnetwork to determine the maximum number of cars that cantravel through a collection of streets with different capacitiesimposed by the number of lanes in the streets and speed limits[28]

For our spatial clustering we consider each cell of the gridas a node each node is defined as a tuple119898 that contains thecoordinates and the value of the resource that is held in thenode such that 119898(119909

119894 119910119894 119911119894) represents an 119894th node in which

119909119894 119910119894represent the position and 119911

119894represents the value of

resource in the node respectively For the clusters each node

International Journal of Distributed Sensor Networks 7

(1) Load the grid-based spatial information into array 119860(119909 119910 119911) 119860 is a three dimensional array(2) Repeat (through all coordinates of 119909)(3) Repeat (through all coordinates of 119910)(4) If (boundary constraints and overlapping constraints are satisfied) Then(5) 119878(119909

119894 119910119894 119911119894) = 119860(119909

119894 119910119894 119911119894)

(6) End-if(7) End-loop(8) End-loop(9) If size of (119878) ge 119870

(10) Find top 119870 clusters where maxsum119911119894⨁119862119896 copy 119878(119909

119894 119910119894 119911119894) to new array 119862(119909

119894 119910119894 119911119894) forall119894 isin 119862

119896

(11) Else-if(12) 119862(119909

119894 119910119894 119911119894) = 119878(119909

119894 119910119894 119911119894) forall119894

(13) End-if

Pseudocode 1 Pseudocode of the proposed LP model for spatial clustering

can potentially be a center of a cluster and the cluster hasa fixed radius of length 119903 The LP model for our problem ismathematically shown as follows

Total value = ⋃

selected clusters ⟨119862119896|119896=1sdotsdotsdot119870⟩sum

119898119894isin119862119896

119898119894(lowast lowast 119911

119894)

= argmax119883119884

sum

0le119909119894le119883

0le119910119894le119884

119870

sum

119896=1

119911119897ni 119898119897(119909119894 119910119895 119911) oplus 119888

119896

(3)

Subject to the boundary constraints of 2r le |119909119894minus 119909119895| and 2r

le |119909119894minus 119909119895| for all 119894 and 119895 but 119894 = 119895 where119883 is the maximum

width and 119884 is the maximum length of the 2D spatial arearespectively 119896 isin 119870 is the maximum number of clusters and119888119896is the 119896th cluster under consideration in the optimizationIn order to implement the computation as depicted in

(3) for each node we sum each group resources in a shapeof diamond (which geometrically approximates a circle) Byiterating through every combination of119870 nodes in the grid ofsize 119883 by 119884 each current node in the combinations is beingtested by considering it as the center of a cluster that has aradius of r hence storing the resource values of the nodesfrom the potential clusters into a temporary array buffer119860(lowast lowast 119911

119894) The results from those potential clusters which

do satisfy the boundary and nonoverlapping constraints arethen copied to a candidate buffer Out of the clusters whoseresource values are stored in the candidate buffer 119878 thecombination of 119870 clusters that has the great total resourcevalue is selected and their values are placed in the final buffer119862 The corresponding pseudocode is shown in Pseudocode 1

5 Experimental Results and Analysis

In this section the performance of the proposed methodol-ogy is shown by presenting both numerical and visualizedresults for all performance aspects over various algorithms Acase study of road traffic is used in the experimentThe spatialarea is a metropolitan traffic map with roads and streetsspanning all over the place The resource value in this case isthe concentration or density of vehicle traffic flows Sensorsare assumed to have been deployed in every appropriate point

Table 1 Comparison between Bwmorph function and thinningalgorithm

Bwmorph function Thinning algorithmDataset 1 Dataset 2 Dataset 1 Dataset 2

Degree ofthinning Incomplete Complete

Elapsed time(secs) 20 38 100 198

Complexity 119874(119899) 119874(1198992)

of the roads thereby a typical traffic volume is each of thesepoints is known The optimization of spatial clustering inthis case can be thought of as optimal resource allocationfor example cost-effective police patrols gas stations orenvironment-pollution controls are needed among thosedense traffic spots

51 Data Preprocessing Two different factual datasets areused for experiments The first dataset is published byMaricopa Association of Governments in 2008 which isa traffic volume map Traffic volumes were derived fromthe national traffic recording devices Seasonal variation isfactored into the volumes The second dataset is an annualaverage daily traffic of Baltimore County Traffic VolumeMapin 2011 in USA prepared by the Maryland Department ofTransportation and published by March 19 2012 The trafficcount estimates are derived by taking 48-hourmachine countdata and applying factors frompermanent count stationsThetraffic counts represent the resource values in a general sense

After using skeleton extraction a two-tone image wasobtained from the original map Readers are referred to therespective websites where they can see the traffic volume datathat are associated with our two datasets (a) Representativetraffic volume map of dataset 1mdashTraffic Volume Map ofPhoenix AZUSA (httpphoenixgovstreetstrafficvolume-map) (b) Representative traffic volume map of dataset2mdashTraffic Volume Map of Baltimore MD USA (httpwwwmarylandroadscomTraffic Volume MapsTraffic VolumeMapspdf) And the corresponding result skeleton extraction

8 International Journal of Distributed Sensor Networks

Table 2 Important statistics from the clustering and LP experiments

Method Cluster number Number of cells covered Minimum Maximum Overlap

KM

Cluster 1 428 0 3499327 0Cluster 2 468 0 546896 0Cluster 3 448 0 20503007 0Cluster 4 614 0 6894667 0Cluster 5 618 0 900908 0

XM

Cluster 1 615 0 591265 0Cluster 2 457 0 546896 0Cluster 3 609 0 900908 0Cluster 4 465 0 3499327 0Cluster 5 430 0 20503007 0

EM

Cluster 1 1223 0 2292 61817229Cluster 2 7 141048 243705 313018Cluster 3 81 0 3033733 131146577Cluster 4 64 26752 546896 330881249Cluster 5 1201 0 1300026 217950471

DB

Cluster 1 13 23614 33146 327222911Cluster 2 11 1686825 21001 363965818Cluster 3 13 178888 2945283 196118393Cluster 4 11 847733 211008 58940877Cluster 5 2528 0 546896 20554176

HC

Cluster 1 291 0 3499327 0Cluster 2 191 0 20503007 96762283Cluster 3 294 0 1590971 0Cluster 4 224 0 189812 12673555Cluster 5 243 0 546896 0

LP

Cluster 1 221 0 3499327 0Cluster 2 221 0 20503007 0Cluster 3 221 0 1590971 0Cluster 4 221 0 189812 0Cluster 5 221 0 546896 0

Table 3 Comparison for running time of the first dataset

Formats KM HC DBscan XM EM LPVector database 327 1252 2324 278 930 183Raster database 342 1536 2820 284 984 201RasterP (16 grids) 198 134 508 046 057 078RasterP (25 grids) 009 014 115 021 012 053

in dataset 1 is shown in Figure 8 where (a) adopted a kind ofmorphological operation method and (b) adopted thinningalgorithm respectively Likewise the corresponding resultskeleton extraction in the second dataset is shown inFigure 9 where (a) adopted a kind of morphologicaloperation method and (b) adopted thinning algorithmrespectively The comparison result of the two datasets isshown in Table 1

For the raw dataset we firstly perform the image prepro-cessing over it to obtain numerical database

The results of the skeleton extraction as shown in Figures8(b) and 9(b) are more clearly and useful for the following

processing Subsequently the clustering by grid can bereadily obtained from the preprocessed imagesThe extent ofimage thinning is better and more complete by the thinningalgorithm than the Bwmorph function in MATLAB But theelapsed time is longer due to a two-layer iteration nestingprocedure in the program code

The choice of placing a grid on the image follows oneprinciple mesh segmentation is not trying to fall on a con-centrated position of traffic flow Since there is no endpointthe midpoint of the two adjacent values was considered ademarcation point Under this assumption the traffic flow ineach grid is calculated and stored digitally in an Excel file Adigital data for the trafficmap serves as the initial data for thesubsequent clustering process

52 Comparison Result of KM and HC Clustering InXLMiner two methods were used to perform clustering KMand HC In order to compare the two methods for the twodatasets input variables were normalized and the numberof clusters is set at five and maximum iterations at 100 Theinitial centroids are chosen randomly at start Furthermore

International Journal of Distributed Sensor Networks 9

Table 4 Comparison for log-likelihood of first dataset

Formats KM HC DBScan XM EMVector database minus1241868 minus1407265 minus1328599 minus119533 minus1249562Raster database minus1342238 minus1502863 minus1378889 minus129632 minus1339769RasterP (16 grids) 1262264 minus1402266 minus1248583 minus1239419 minus1244993RasterP (25 grids) minus1241868 minus1319417 minus1122207 minus1248201 minus1162048

(a) (b)

Figure 8 (a) Result of skeleton extraction in dataset 1 using Bwmorph function (b) Result of skeleton extraction in dataset 1 using thinningalgorithm

(a) (b)

Figure 9 (a) Result of skeleton extraction in dataset 2 using Bwmorph function (b) Result of skeleton extraction in dataset 2 using thinningalgorithm

Table 5 Comparison for running time of the second dataset

Formats KM HC DBScan XM EM LPVector database 139 134 1553 153 1005 337Raster database 241 1478 1834 217 823 196RasterP (16 grids) 047 801 1274 045 377 144RasterP (25 grids) 035 620 1098 036 296 118

the weights for the corresponding three attributes (119909 119910 V)for each grid (119892

119894= (119909119894 119910119894 V119894)) based on defining weight of

119909 and 119910 could be varied (fine-tuned) and the sum of weightsmust be equal to 1 We tested several variations searching forthe best clustering results (1) weight of V is 20 (2) weightof V is 40 (3) weight of V is 50 (4) weight of V is 60 (5)weight of V is 80 (6) all of them have same weight at 333

(7) weight of V is 0 (8) same weight except when 119892119894(V119894= 0)

and (9) weights of 119909 and 119910 are both 0 except when 119892119894(V119894= 0)

In HC method normalization of the input data waschosen Another option available is similarity measure Itadopts Euclidean distance to measure raw numeric dataMeanwhile the other two options Jaccardrsquos coefficients andmatching coefficient are activated only when the data isbinary

For the above nine cases results of cases (1) to (6) aresimilar in their separate methods And result of (9) is theworst which does not accomplish any clustering Results ofcases (2) (3) (7) and (8) are demonstrated in Figure 10

For the distribution of clusters in the result of KMclustering method more than half of data points are clampedinto one oversized cluster The result of this method istherefore not helpful for further operation For HC methoddata on average are allocated into separate clustersThe result

10 International Journal of Distributed Sensor Networks

Table 6 Comparison for log-likelihood of second dataset

Formats KM HC DBScan XM EMVector database minus1735412 minus1962367 minus1753576 minus1721513 minus1657263Raster database minus1815926 minus2012568 minus1970756 minus1815791 minus1848209RasterP (16 grids) minus1551437 minus1724736 minus1637147 minus1701283 minus1566231RasterP (25 grids) minus1484761 minus1663789 minus1509146 minus1667312 minus1647823

1

2

5

4

41

3

32

5

(a)

1

2 3

4

5

4 51

3

2

(b)

5

5

4

1

4

1 3

3

2

2

5

(c)

2

2

4

3

3

1 5

5

4

1

(d)

Figure 10 (a) Clustering results for the first dataset with setting case (2) where weight of V is 40 top half uses KM clustering methodand bottom half uses HC method (b) Clustering results for the first dataset with setting case (3) where weight of V is 50 top half uses KMclustering method and bottom half uses HC method (c) Clustering results for the first dataset with setting case (7) where weight of V is 0top half uses KM clustering method and bottom half uses HC method (d) Clustering results for the first dataset with setting case (3) whereall share the same weight except 119892

119894(V119894= 0) top half uses KM clustering method and bottom half uses HC method

Table 7 Comparison of running time (in seconds) of four differentsizes of dataset

Dataset size KM HC DBScan XM EM LP100 grid cells 006 007 105 219 321 0184600 grid cells 042 295 3989 273 1905 93710000 grid cells 262 4667 9755 297 3785 242180000 grid cells 1975 18961 684 647 19831 9083

in Figure 10(c) is the best showing only the one with distinctposition attributes (119909 and 119910) The other three results (Figures10(a) 10(b) and 10(d)) are stained with cluster overlapsTherefore allocation of critical resource for example in eachcluster may result in a waste of resources The degree ofoverlap is the least in the result of Figure 10(b) If only locationis being considered the result of Figure 10(c) is the bestchoice Otherwise the result in Figure 10(b) is better than theother two for the sake of cluster distribution

The clustering results of the second dataset performanceby using the two methods KM and HC are shown inFigure 11

From the results of the cluster distribution of the seconddataset obtained by both clustering methods the size of eachcluster is more or less similar which is better than that ofthe first dataset And there is no overlap phenomenon inthe KM results This is a promising feature of KM methodfor spatial clustering However there is little overlap in theresult of HC method as the clusters seem to take irregularshapes Above all for the second dataset KM is a better choicefor consideration of even cluster distribution and overlapavoidance by using both clustering methods

53 Results of Grouping In this part we compare the coloredmap of Raster (119909 119910 V) data model in two datasets usingfive clustering methods in Weka and the LP method Thecommon requirement is no overlap for each of the resultingmaps The number of cluster is arbitrarily chosen at five The

International Journal of Distributed Sensor Networks 11

Table 8 Numeric results of coverage of each cluster by using the six methods for dataset 1

Cov-db1 KM EM DBScan XM HC LPCluster 0 0029436 0003786 0017902 0075178 0013153 0028985Cluster 1 0301538 0269602 0208078 0049761 0026016 0377034Cluster 2 0215277 0001627 0158439 0084049 012436 0080099Cluster 3 0046788 0096221 0079177 0209390 0001172 0217204Cluster 4 0002712 0161799 0044197 0043152 03043 0007704Total coverage 0595751 0533036 0507793 0461531 0469 0711025

41 5

3 2

(a)

4

13

5

2

(b)

Figure 11 (a) Clustering results for the second dataset by usingKMmethod (b) Clustering results for the second dataset by usingHCmethod

result of first dataset is shown in Figure 12 The first part (i)of Figure 12 shows the spatial clustering result the secondpart (ii) visualizes the corresponding spatial groups by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method The centers of the clustersare computed after clustering is done and then the groupsare visualized over the clustering results according to thecomputed centers

In Figure 12 for the results of (a) and (e) the sizes ofclusters are quite uneven more than half of dataset fall intoone cluster Thus this result reveals a fact that the techniquecannot organize a dataset into homogeneous andor well-separated groups with respect to a distance or equivalentlya similarity measureThe corresponding groups have overlapphenomenon too For the result of (c) the sizes of the clustersare uneven too For the result of (b) and (d) the sizes of clusterseem to be similar to each other There is also no overlapin the clustering result but for group result the groups in(d) have far more overlaps than those in (b) Overlap meanssome part or the cluster gets in the way of another onewhichmeans that there is superposition between two ormoredifferent clusters Again itmay cause resourcewaste and evenfalse allocation This situation occurs in important fields ofapplications such as information retrieval (several thematicfor a single document) and biological data (several metabolicfunctions for one gene) For this reason (b) is better than (d)According to the above analysis for the result of clusteringand corresponding groups (d) XM is so far the best choiceof clustering algorithm as evidenced by the colored mapsthereafter

With the same experiment setup and operating environ-ment the spatial clustering experiments are performed overthe second dataset The results of second dataset are shown

in Figure 13 where (i) represents the spatial clustering resultand (ii) represents the corresponding spatial group by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method

In Figures 13(a) and 13(e) it is noticed that the clustersare imbalanced and there are overlaps in the correspondingspatial groups using the method of (a) EM and (e) DBScanThe results of (b) KM and (d) XM however avoid theshortcomings of (a) and (e) though they still have slightoverlaps For (c) HC we remove the empty cells in theboundary to reduce the size of dataset the clustering resultis perfect There is no overlap and clusters are balancedbetween each other But there is still overlap in the spatialgroups Thus LP method is adopted to solve this problemand in possession of same size of groups The result of LPmethod yields perfectly balanced groups without any overlapas shown in Figure 13(f)

By visually comparing the clustering results of the twodatasets the clustering results seem to be similar but thespatial groups are somewhat differentOccurrence of overlapsin spatial groups is more severe in the first dataset than in thesecond one The overlaps are likely due to data distributionand balance in sizes between each cluster For the first datasetthe heavy spatial values which are traffic volumes in this caseare mainly concentrated in the center region (city center) solocations of the computed clusters tend to cram very near ata crowded spot In contrast the traffic volumes in the seconddataset are dispersed over a large area As seen from the visualspatial groups of the second dataset the cluster positions area little far apart when compared to those in the first dataset

Based on the results generated from the clustering andLP experiments some statistic information of dataset 2 iscollected and it is shown in Table 2 The numeric results in

12 International Journal of Distributed Sensor Networks

(i)

(ii)

(a)

(i)

(ii)

(b)

(i)

(ii)

(c)

(ii)

(i)

(d)

(i)

(ii)

(e)

Figure 12 (a) (i) Spatial clustering on dataset 1 by using EM (ii) Spatial groups on dataset 1 from the results of using EM (b) (i) spatialclustering on dataset 1 by using KM (ii) Spatial groups on dataset 1 from the results of using KM (c) (i) spatial clustering on dataset 1 byusing HC (ii) Spatial groups on dataset 1 from the results of using HC (d) (i) spatial clustering on dataset 1 by using XM (ii) Spatial groupson dataset 1 from the results of using XM (e) (i) spatial clustering on dataset 1 by using DBScan (ii) Spatial group on dataset 1 from theresults of using DBScan

Table 3 support the qualitative analysis by visual inspectionin the previous section By comparingHC and LPmethods asan example the quantitative results show that they have thegreatest differences in cell numbers covered by the clustersalso the amount of overlap in HC is the highest of all By theLP method the size of each cluster is exactly the same andthey are totally free from overlap

6 Technical Analysis of Clustering Results

61 Experimental Evaluation Method For the purpose ofassessing how the qualities of spatial groups from clusteringare several evaluation factors are defined here running time(short for time) balance log-likelihood overlap densityand coverage For a fair comparison the datasets are run inthe same software environment of the same computer And

assume the number of groups to be five with six differentmethods Running time is the time we used to run eachmethod using the same software in the same computerto completion Balance is used to measure the sizes ofgroups if balanced the size of each group is the same Log-likelihood is an important measure for clustering qualitythe bigger the value the better Weka tests for goodness-of-fit by the likelihood in logarithm which is called log-likelihood A large log-likelihood means that the clusteringmodel is suitable for the data under test Overlap means thatthe spatial values (eg traffic volumes sensed by the sensors)do belong to more than one cluster Density is the averagespatial values (traffic volumes) per grid cell in each clusterCoverage of a cluster means the proportion of traffic volumesthat are covered by the grid cells within the cluster overthe whole dataset meanwhile total coverage is the sum of

International Journal of Distributed Sensor Networks 13

(i)

(ii)

(a)

(i)

(ii)

(b)

(i)

(ii)

(c)

(i)

(ii)

(d)

(i)

(ii)

(e)

(f)

Figure 13 (a) (i) spatial clustering on dataset 2 by using EM (ii) Spatial groups on dataset 2 from the results of using EM (b) (i) spatialclustering on dataset 2 by using KM (ii) Spatial groups on dataset 2 from the results of using KM (c) (i) spatial clustering on dataset 2 byusing HC (ii) Spatial groups on dataset 2 from the results of using HC (d) (i) spatial clustering on dataset 2 by using XM (ii) Spatial groupson dataset 2 from the results of using XM (e) (i) spatial clustering on dataset 2 by using DBScan (ii) Spatial group on dataset 2 from theresults of using DBScan (f) Spatial group in LP method on dataset 2

14 International Journal of Distributed Sensor Networks

traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below

Density (cluster 119894) =sumTraffic Volumes (cluster 119894)

Grid Cell Number (cluster 119894)

Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)

sumGrid Cell Number

Total Coverage = sumTraffic Volumes minusOverlaps

Proportion of Cluster (119894) Size (Balance)

=Grid Cell Number (cluster 119894)

sumGrid Cell Number

(4)

62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively

According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest

In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one

In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14

According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On

0

100

200

300

400

500

600

700

800

900

0 20000 40000 60000 80000 100000

K-meansHierarchicalDBScanXMeanEM

Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP

K-means)

Figure 14 Comparison of running time (in seconds) of differentsizes of dataset

the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally

In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one

In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC

The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets

From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage

International Journal of Distributed Sensor Networks 15

Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2

Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700

0

01

02

03

04

05

06

07

08

KM EM DBScan XM HC LP

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

CoverageCoverage

(a)

0

01

02

03

04

05

06

KM EM DBScan XM HC LP

Coverage

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

Coverage

(b)

Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2

Total density

0

200

400

600

800

1000

1200

1400

1600

KM EM DBScan XM HC LP

Density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(a)

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

KM EM DBScan XM HC LP

Density

Total density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(b)

Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2

16 International Journal of Distributed Sensor Networks

4

5136

8

1

Balance test on dataset 1

(a) KM

1

50

1

18

30

Balance test on dataset 1

(b) XM

6

22

24

30

18

Balance test on dataset 1

(c) EM

24

24

17

20

15

Balance test on dataset 1

(d) DBScan

18

17

22

19

25

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP

From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means

that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered

International Journal of Distributed Sensor Networks 17

17

18

17

24

24

Balance test on dataset 2

(a) KM

24

18

24

18

17

Balance test on dataset 2

(b) XM

47

032

47

Balance test on dataset 2

(c) EM

1010

98

Balance test on dataset 2

(d) DBScan

23

15

24

18

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP

The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely

63 Discussion of G119899119890119905 For all the six evaluation factors each

of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in

18 International Journal of Distributed Sensor Networks

Table 10 Numeric results of density of each cluster by using the six methods for dataset 1

Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049

Table 11 Numeric results of density of each cluster by using the six methods for dataset 2

Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447

order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866

119888can take a relatively very large value or even

1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes

119866119897=

10038161003816100381610038161003816100381610038161003816

LikelihoodTime

10038161003816100381610038161003816100381610038161003816

(5)

119866119887=Difference of Balance

Time (6)

119866119889=DensityTime

(7)

119866119888=CoverageTime

(8)

119866119900=OverlapTime

(9)

119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)

Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)

From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in

Table 12 which allows us to easily compare various methodsand performance aspects

In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866

119897 119866119887 119866119889 119866119888 and 119866

119900are computed

for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13

According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion

7 Conclusion and Future Works

Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for

International Journal of Distributed Sensor Networks 19

Table 12 Performance indicators of the six methods based on dataset 2

Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0

Table 13 Comparison of different clustering and LP methods by119866net indicator

Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132

purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users

The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes

For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be

good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed

References

[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000

[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012

[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012

[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012

[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012

[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003

[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006

[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002

[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008

[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006

[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004

[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998

[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000

20 International Journal of Distributed Sensor Networks

[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004

[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005

[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010

[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998

[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006

[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999

[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001

[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009

[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999

[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000

[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996

[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963

[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

2 International Journal of Distributed Sensor Networks

determines on how we should distribute the sensors butalso after the deployment how these logical groups are beingformed possibly for further applications

For experiments we attempted to apply several clusteringalgorithms the choices of these algorithms are classical andpopular in data mining research community The effec-tiveness of different clustering algorithms is measured forcomparison However none of the clustering algorithms canachieve the best results At the end we develop a simple andnovelmethod based on linear programming for optimizationwhich we called LP LP is shown to be able to achieveoptimal grouping over different configurations and cases ofexperiments

The contribution of the paper is an in-depth investiga-tion into the grouping problem that arises right after thedeployment of a ubiquitous sensor network We propose anovel solution to achieve optimal groups by using linearprogramming though several clustering algorithms havebeen put into test

The remaining of the paper is structured as followsSection 2 introduces the background techniques of spatialclustering Section 3 surveys on spatial data representa-tion how the spatial data are encoded for postprocessingSection 4 describes our methodology for obtaining optimalgroups over spatial data Section 5 reports about the exper-iments Section 6 analyses and compares the experimentalresults Section 7 concludes the paper

2 Overview of Spatial DataClustering Techniques

Clustering is the organization of a dataset into homogeneousandor well-separated groups with respect to a distance orequivalently a similarity measure [6] Spatial data clusteringhas numerous applications in pattern recognition [7 8]spatial data analysis [9ndash11] market research and so forth [1213] which gather data to find all non concentrate models andspecial things among geographical dataset And spatial dataclustering is an important instrument of spatial data miningwhich has become a powerful tool for efficient and complexanalysis of huge spatial databases [12]with geometric featuresand is liked by most people Conventional methods ofclustering algorithm are classified into four types partitionmethod hierarchical method density-based method andgrid-based method

However with the extension of research objects andscope it has been discovered to have shortcomings Manyexisting spatial clustering algorithms cannot cluster withirregular obstacles reliably A grid-density-based hierarchicalclustering (HC) algorithm is proposed to tackle this problemThe advantage of grid-based clustering algorithm is reducingthe quality of calculation An alternative approach [14] isproposed that can effectively form clustering in the presenceof obstacles The shapes of the clusters can be arbitrarilydefined Moreover the hierarchical strategy is used to reducethe complexity in presence of obstacles and constraints andto improve the operation efficiency [13] And the result isthat it can deal with spatial clustering while it faces obstacles

and constraints and get better performanceWhen some datapoints do not work in any cluster for density clustering thissituation was managed by using grid-based HC instead inthis study And each clustering algorithm has its individualadvantages and disadvantages

Thepartition approach separates119873objects into119898 groupsmeanwhile 119898 satisfies the following constraints firstly eachgroup contains one object at least secondly each objectmust belong to one group In order to achieve a globaloptimum in grouping it is necessary to list all possibleclusters and most applications will adopt KM K-medoidor fuzzy analysis However this partitioning method consistsof some problems when applied to spatial mining towardsclustered objects especially the objects that are obstructedby some environment conditions Such as river it is hard torecognize the comparability

This method can be efficiently carried out by clusteringno matter how large the number of objects Cluster analysisalgorithm in general cannot deal with large datasets It isrecommended that the maximum number of objects to dealwith in this method should be no more than 1000 It is astochastic huntingway based on partitioning in the clusteringmethod due to its low efficiency and the capability of thismethod is much affected by the random selection of thestochastic initial value [14]

HC is another popular clustering method which ismore flexible than partitioning-based clustering but it has ahigher time complexity HC algorithms create a hierarchicaldecomposition (aka dendrogram) of dataset based on somecriterion [15 16] According to the rule of generation inhierarchical decomposition there are two different types ofHC methods agglomerative and divisive For agglomerativealgorithm it starts with producing leaves and it combinesclusters in a bottom-up way For divisive algorithm theclustering starts at the root and it recursively separates theclusters in a top-down way The process continues until astopping criterionmdashusually it stops when the required 119896

clusters are accomplished However this hierarchical methodconsists of some problems vagueness of termination criteriaonce a step is complete it cannot be revoked As long as theneighborhood density (the number of objects or data points)does not grow over a certain threshold clustering continues[17] In other words for a given point in each cluster in theneighborhood of a given radius it must contain at least aminimum number of points As a result the noisy data canbe filtered and better clusters with arbitrary shape can befound DBScan and its expansion algorithm which is calledOPTICS are two of these classical density-based methodsThey perform clustering based on the type of density-basedconnectivity

Grid-based method quantifies object space to a restrictednumber of cells forming a grid structure All clusteringoperations are in the grid structure (ie quantitative space)The main benefit of this method is its high speed therun time is usually not restricted by the data size whichonly depends on the number of cells in each dimensionThe algorithm STING (statistical information grid-basedmethod) [18] works with numerical attributes (spatial data)and is designed to facilitate ldquoregion-orientedrdquo queries

International Journal of Distributed Sensor Networks 3

Nevertheless the spatial groups obtained by classic algo-rithms have certain limitations that is overlaps cannot becontrolled and the maximum coverage by the resultantgroups is not guaranteed Overlaps lead to resource wasteand potentially resourcemismatch Besides spatial clusteringthis situation occurs in other fields of applications suchas the information retrieval (several thematic for a singledocument) biological data (several metabolic functions forone gene) and martial purpose (discover object densenessregions independently) However there has been no studyreported in the literature that the authors are aware ofthat applies LP method to discover spatial groups that arefree of the limitations inherited from clustering algorithmsThus this research provides an alternative method to achievespatial groups for maximum coverage in real environmentMaximum coverage in this context here is defined as thegreatest possible area of effect covered by the spatial groupswith none or minimum overlaps among the groups

3 Spatial Data Representation

Two main categories of spatial data representation existspatial data and attribute data Spatial data means refer-enced data in the earth such as maps photographs andsatellite imageries Though these representation techniquesoriginated from GIS the underlying coding formats arecommon compared to those for wireless sensor networksas long as they are distributed over a wide spatial area innature Generally spatial data represents geographic featuresin complete and relative locations Attribute data representsthe spatial features in characteristics which can be in quantityandor in quality in real world Attribute data is often referredto as tabular data In our experiments we test both typesof data models versus different clustering algorithms for athorough investigation

31 Spatial Data Model In early days spatial data is storedand represented in a map format There are three fundamen-tal types of spatial data models for recording the geographicdata digitally They are vector raster and image

Figure 1 as shown in the following illustrates the encodingtechniques of two important spatial data [19] which are rasterand vector over a sample aerial image of Adriatic Sea andcoast in Italy The image type of encoding is very similar toraster data in terms of usability of techniques But it has a limitin internal formats when it comes to modeling and analysisof the data Images represent photographs or pictures in thelandscape in a coarse matrix of pixel values

32 Vector Data Model The three kinds of a forementionedspatial data models are used in storing the geographiclocation with spatial features in dataset The vector datamodel uses 119909 119910 coordinates to define the locations offeatures thereafter theymark points lines areas or polygonsTherefore vector data tend to define centers edges andoutlines of features It characterizes the features by linearsegments using sequential points or vertices A vertex consistsof a pair of 119909 and 119910 coordinates The beginning or ending

of a node is defined in each vertex with arc segment Asingle coordinate pair of vertexes defines a feature point Agroup of coordinate pairs define polygonal features In vectorrepresentation as well as the connectivity between featuresthe storage of the vertices for each feature is important as wellas the sharing of common vertices where features connect

By using the same size polygonal we divide a completemap into small units based on the character of our databasewhich is represented to be (119909 119910 V) where 119909 and 119910 consistof an coordinate pair that represents the referenced spatialposition and V represents something of interest or just calledldquofeaturerdquo which could be a military target a critical resourceor just an inhabitant clan for example The greater the V themore valuable the feature is In spatial grouping formaximumcoverage we opt to include these features that amount to ahighest total value A sample of vector format that representsa spatial location in reference to 2D is shown in Figure 2 [19]

33 RasterDataModel Raster datamodelsmake use of a gridof squares to define where features are located These squareswhich are also called pixels or cells typically are of uniformsize

From our dataset we separate the whole image by impos-ing a grid on it hence producing many individual featureswith one feature corresponding to each cell We considerusing raster data model to represent the dataset and we storethe features by the following two different encoding formats

(1) Raster data are stored as an ordered list of cell valuesin pairs of (119894 V) where 119894 is a sequential number of thecell indices and V is the value of the 119894th feature forexample (1 80) (2 80) (3 74) (4 62) (5 45) and soon as shown in Figure 3

(2) Raster data are stored as points (119909 119910 V) with 119909 and 119910

as position coordinates locating to the correspondingspatial feature with value V for example (1 1 513) (12 514) (1 3 517) (2 1 512) (2 2 515) and so on asshown in Figure 4 In this case the value V refers tothe center point of the grid cell This encoding will beuseful for representing measured values at the centerpoint of the cell for example raster of elevation

(3) During the experiment the grid size is transformedfor efficient operation So we put 1198942 cells together asone unit representing one new grid cell as shown inFigure 5

In particular the quad tree data structure for storingthe data is found to be useful as an alternative encodingmethod to raster data model Raster embraces digital aerialphotographs imagery from satellites digital pictures or evenscanned maps Details on how different sorts of objects likepoint line polygon and terrain are represented by the datamodels can be found in [19ndash21]

4 Proposed Methodology

Theaim of themethodology is to determine a certain numberof clusters and their corresponding locations from some

4 International Journal of Distributed Sensor Networks

Real world Vector Rasterimage

Figure 1 Representation of how a real-world spatial area is represented by vector and raster encoding formats

13121110

987654321

0 1 2 3 4 5 6 7 8 9 10 11 12 13Width

Hei

ght

Colu

mn

Row

x y coordinates are 9 3

Figure 2 Vector format

80 74 62 45 45 34 39 56

80 74 74 62 45 34 39 56

74 74 62 62 45 34 39 39

62 62 45 45 34 34 34 39

45 45 45 34 34 30 34 39

Figure 3 Raster format in ordered list

collected spatial data In this process different methods aretested for choosing the one which covers themost area as wellas the highest feature values from the suggested clustersTheflow of this process including preprocessing of sensor data

515 519 521 523

523

523517 528 527

512 510 520

518511 512 516

514

517

511

510

512 516 517 520513

Figure 4 Raster data with center point

80

80

74

62

45

74

74

74

62

45

62

74

62

45

45

45

62

62

45

34

45

45

45

34

34

34

34

34

34

30

39

39

39

34

34

56

56

39

39

39

Figure 5 Raster format with 22 and 32 grids

data transformation clustering and finding cluster center-points is shown in Figure 6

In case of a satellite image or image captured by fighter-jetor other surveillance camera image processing is needed to

International Journal of Distributed Sensor Networks 5

Load spatial image

RGB image

Gray imageSkeleton extraction

Morphological operation in MATLAB (Bwmorph)

Zhangrsquos algorithmare used forcomparison

Two-tone image

Indexed grid image

2D special data

Method comparison

Griddingindexing image

Numerical dataset (with normalization)

Spatial grouping

Hierarchical K-means LPDBScan

Color map

Output

2 algorithms

middot middot middot

Preprocessing of imageData transformation

GroupingDisplay

Figure 6 Workflow of proposed methodology

extract the density information from pictures But in our caseof sensor network we can safely assume that the data fed fromanet of sensorswould have the sensor ID attachedThe sensorIDs are known so are their positions From the locations ofthe sensors and their sensor ID we could possibly relate thedata that was collected to their corresponding locations inthe 119909-119910 format of coordinates (assume the terrain is of 2D)In order to reduce the huge amount of calculation and storagespace a grid was used to divide the whole map into smallerpieces The grid indexing operation is repeated for a range ofdifferent coarse layers thereby providing different resolutionsof data partitions Similar technique is reported in [22] whichis computed by Euclidian distance Obviously the methodof grid indexing helps separate data into cells based on theirgeographic locations

To obtain a better result of spatial groups for maximumcoverage and its corresponding cluster center point with cer-tain constrains the research adopts several popular cluster-ing methods and linear programming method by using soft-ware programs such as XLMiner (httpwwwsolvercomxlminer-data-mining) MATLAB (httpwwwmathworkscomproductsmatlab) and Weka (httpwwwcswaikatoacnzmlweka)

The core purpose of cluster analysis is to comprehend andto distinguish the extent of similarity or dissimilarity amountof the independent clustered objects There are five majormethods of clusteringmdashKM EM XM HC and DBScan

119870-means (KM) byMacQueen 1967 is one of the simplestalgorithms that solve thewell-known clustering problem [23]It is an easy and simple method to divide a dataset into acertain number of clusters initially assuming that the numberof clusters is 119896 fixed a priori for each cluster which is themainidea The random choice of the initial location of centroids

leads to various results A better choice is to place them asmuch far away from each other as possible

The KM algorithm aims at minimizing an objective fun-ction In this case a squared error function is as follows

119895 = sum

forall119894

sum

forall119895

10038171003817100381710038171003817119909119894(119895) minus 119888

119895

10038171003817100381710038171003817

2

(1)

where 119895 ranges from 1 to 119896 119894 range from 1 to 119899 and119909119894(119895) minus 119888

1198952 is a chosen distance measure between a data

point 119909119894(119895) and the cluster center 119888

119895 which is an indicator of

the distance of the 119899 data points from their respective clustercenters The sum of distances or sum of squared Euclideandistances from the mean of each cluster is a quite normal orusual measure for causing scattering in all directions in thecluster in order to test the suitability of the KM algorithmClusters are often computed using a fast heuristic methodwhich generally produces good (but not necessarily optimal)solutions

X-Means [24] is an optimal method of KM whichimproves structure part in the algorithm Division of thecenters is attempted in its region It makes decision betweenthe root and children of each center to doing the comparisonbetween the two structures Another improved variant ofKM called EM which execrates maximization makes anassignment on a probability distribution to each further pointwhich represents the probability Howmany clusters to be setup are to be decided by EM using cross-validation

Density-based algorithms regard clusters as dense areasof objects that are separated by less dense areas [25] Becausethey have no limit to look for clusters with spherical shapethey can produce clusters with arbitrary shapes DBScan isa typical implementation of density-based algorithms calleddensity-based spatial clustering of applications with noise

6 International Journal of Distributed Sensor Networks

(a) (b) (c)

Figure 7 Illustration of possible ways of assigning clusters for maximum (a) fish population (b) altitude of terrain and (c) human inhabitantpopulation

[25] The notions of density reachability and density con-nectivity are used as performance indicators for the qualityof clustering [26] A cluster is composed of the group ofobjects in a dataset that are density connected to a particularcenter Any object that falls beyond a cluster is considered asnoise

Ward proposed a clustering method called hierarchicalclustering (HC) in 1963 [27] It tries to find how to formsomething to divide 119875

119899 119875119899minus1

1198751in a way that reduces

the relationship with each group In each step analysisstep it considered every possible cluster pair in group andcombined the two clusters with a very close joining of resultsin ldquoinformation lossrdquo which is given definition by Wardaround ESS (an error sum-of-squares criterion)The idea thatsupports Wardrsquos proposal can be described most simply bythinking of a little single data Take ten objects with scores asan example (2 7 6 6 7 2 2 0 2 0) The loss of informationwould be achieved by calculating ESS with a mean of 34which takes into account the ten scores as a unit as followsESS One group = (2 minus 34)

2+ (7 minus 34)

2+ sdot sdot sdot + (0 minus 34)

2=

4728 However those 10 objects can also be separated intofour groups according to their scores 0 0 0 2 2 2 26 6 and 7 7 Finally for evaluation of the ESS as a sum ofsquares we can obtain four independent error sums of eachsquare Overall the result that divides the 10 objects into 4clusters has no loss of information as follows

ESS One group = ESS group1 + ESS group2

+ ESS group3 + ESS group4 = 0

(2)

The last method we adopted here is linear programming(LP) which contains instituting and producing an answerto optimization problems with linear objective functionsand linear constraints This powerful tool can be used inmany fields especially where many options are possible inthe answers In spatial grouping over a large grid manypossible combinations of positioning the clusters exist Theproblem here is to find a certain number of clusters of

equal size over the area meanwhile the chosen centers ofthe clusters must allow sufficient distance apart from eachother so as to avoid overlapping As an example shownin Figure 7 three clusters would have to be assigned overa spatial area in a way that they would have to covercertain resources The assignment of the clusters howeverwould have to yield a maximum total value summed fromcovered resources In the example the left diagram showsallocating three clusters over the deepwater assuming that theresources are fish hence the maximum harvest The secondexample in the middle of Figure 7 is clustering the highaltitude over the area The last example is trying to coverthe maximum human inhabitants which are concentratedat the coves Given many possible ways of setting up theseclusters LP is used to formulate this allocation problemwith an objective of maximizing the values of the coveredresources

Assuming that the resources could be dynamic forexample animal herds or moving targets whose positionsmay swarm and change over time the optimization is atypical maximal flow problem (or max flow problem) Theoptimization is a type of network flow problem in whichthe goal is to determine the maximum amount of flowthat can occur over arc whish is limited by some capacityrestriction This type of network might be used to modelthe flow of oil in pipeline (in which the amount of oil thatcan flow through a pipe in a unit of time is limited by thediameter of the pipe) Traffic engineers also use this type ofnetwork to determine the maximum number of cars that cantravel through a collection of streets with different capacitiesimposed by the number of lanes in the streets and speed limits[28]

For our spatial clustering we consider each cell of the gridas a node each node is defined as a tuple119898 that contains thecoordinates and the value of the resource that is held in thenode such that 119898(119909

119894 119910119894 119911119894) represents an 119894th node in which

119909119894 119910119894represent the position and 119911

119894represents the value of

resource in the node respectively For the clusters each node

International Journal of Distributed Sensor Networks 7

(1) Load the grid-based spatial information into array 119860(119909 119910 119911) 119860 is a three dimensional array(2) Repeat (through all coordinates of 119909)(3) Repeat (through all coordinates of 119910)(4) If (boundary constraints and overlapping constraints are satisfied) Then(5) 119878(119909

119894 119910119894 119911119894) = 119860(119909

119894 119910119894 119911119894)

(6) End-if(7) End-loop(8) End-loop(9) If size of (119878) ge 119870

(10) Find top 119870 clusters where maxsum119911119894⨁119862119896 copy 119878(119909

119894 119910119894 119911119894) to new array 119862(119909

119894 119910119894 119911119894) forall119894 isin 119862

119896

(11) Else-if(12) 119862(119909

119894 119910119894 119911119894) = 119878(119909

119894 119910119894 119911119894) forall119894

(13) End-if

Pseudocode 1 Pseudocode of the proposed LP model for spatial clustering

can potentially be a center of a cluster and the cluster hasa fixed radius of length 119903 The LP model for our problem ismathematically shown as follows

Total value = ⋃

selected clusters ⟨119862119896|119896=1sdotsdotsdot119870⟩sum

119898119894isin119862119896

119898119894(lowast lowast 119911

119894)

= argmax119883119884

sum

0le119909119894le119883

0le119910119894le119884

119870

sum

119896=1

119911119897ni 119898119897(119909119894 119910119895 119911) oplus 119888

119896

(3)

Subject to the boundary constraints of 2r le |119909119894minus 119909119895| and 2r

le |119909119894minus 119909119895| for all 119894 and 119895 but 119894 = 119895 where119883 is the maximum

width and 119884 is the maximum length of the 2D spatial arearespectively 119896 isin 119870 is the maximum number of clusters and119888119896is the 119896th cluster under consideration in the optimizationIn order to implement the computation as depicted in

(3) for each node we sum each group resources in a shapeof diamond (which geometrically approximates a circle) Byiterating through every combination of119870 nodes in the grid ofsize 119883 by 119884 each current node in the combinations is beingtested by considering it as the center of a cluster that has aradius of r hence storing the resource values of the nodesfrom the potential clusters into a temporary array buffer119860(lowast lowast 119911

119894) The results from those potential clusters which

do satisfy the boundary and nonoverlapping constraints arethen copied to a candidate buffer Out of the clusters whoseresource values are stored in the candidate buffer 119878 thecombination of 119870 clusters that has the great total resourcevalue is selected and their values are placed in the final buffer119862 The corresponding pseudocode is shown in Pseudocode 1

5 Experimental Results and Analysis

In this section the performance of the proposed methodol-ogy is shown by presenting both numerical and visualizedresults for all performance aspects over various algorithms Acase study of road traffic is used in the experimentThe spatialarea is a metropolitan traffic map with roads and streetsspanning all over the place The resource value in this case isthe concentration or density of vehicle traffic flows Sensorsare assumed to have been deployed in every appropriate point

Table 1 Comparison between Bwmorph function and thinningalgorithm

Bwmorph function Thinning algorithmDataset 1 Dataset 2 Dataset 1 Dataset 2

Degree ofthinning Incomplete Complete

Elapsed time(secs) 20 38 100 198

Complexity 119874(119899) 119874(1198992)

of the roads thereby a typical traffic volume is each of thesepoints is known The optimization of spatial clustering inthis case can be thought of as optimal resource allocationfor example cost-effective police patrols gas stations orenvironment-pollution controls are needed among thosedense traffic spots

51 Data Preprocessing Two different factual datasets areused for experiments The first dataset is published byMaricopa Association of Governments in 2008 which isa traffic volume map Traffic volumes were derived fromthe national traffic recording devices Seasonal variation isfactored into the volumes The second dataset is an annualaverage daily traffic of Baltimore County Traffic VolumeMapin 2011 in USA prepared by the Maryland Department ofTransportation and published by March 19 2012 The trafficcount estimates are derived by taking 48-hourmachine countdata and applying factors frompermanent count stationsThetraffic counts represent the resource values in a general sense

After using skeleton extraction a two-tone image wasobtained from the original map Readers are referred to therespective websites where they can see the traffic volume datathat are associated with our two datasets (a) Representativetraffic volume map of dataset 1mdashTraffic Volume Map ofPhoenix AZUSA (httpphoenixgovstreetstrafficvolume-map) (b) Representative traffic volume map of dataset2mdashTraffic Volume Map of Baltimore MD USA (httpwwwmarylandroadscomTraffic Volume MapsTraffic VolumeMapspdf) And the corresponding result skeleton extraction

8 International Journal of Distributed Sensor Networks

Table 2 Important statistics from the clustering and LP experiments

Method Cluster number Number of cells covered Minimum Maximum Overlap

KM

Cluster 1 428 0 3499327 0Cluster 2 468 0 546896 0Cluster 3 448 0 20503007 0Cluster 4 614 0 6894667 0Cluster 5 618 0 900908 0

XM

Cluster 1 615 0 591265 0Cluster 2 457 0 546896 0Cluster 3 609 0 900908 0Cluster 4 465 0 3499327 0Cluster 5 430 0 20503007 0

EM

Cluster 1 1223 0 2292 61817229Cluster 2 7 141048 243705 313018Cluster 3 81 0 3033733 131146577Cluster 4 64 26752 546896 330881249Cluster 5 1201 0 1300026 217950471

DB

Cluster 1 13 23614 33146 327222911Cluster 2 11 1686825 21001 363965818Cluster 3 13 178888 2945283 196118393Cluster 4 11 847733 211008 58940877Cluster 5 2528 0 546896 20554176

HC

Cluster 1 291 0 3499327 0Cluster 2 191 0 20503007 96762283Cluster 3 294 0 1590971 0Cluster 4 224 0 189812 12673555Cluster 5 243 0 546896 0

LP

Cluster 1 221 0 3499327 0Cluster 2 221 0 20503007 0Cluster 3 221 0 1590971 0Cluster 4 221 0 189812 0Cluster 5 221 0 546896 0

Table 3 Comparison for running time of the first dataset

Formats KM HC DBscan XM EM LPVector database 327 1252 2324 278 930 183Raster database 342 1536 2820 284 984 201RasterP (16 grids) 198 134 508 046 057 078RasterP (25 grids) 009 014 115 021 012 053

in dataset 1 is shown in Figure 8 where (a) adopted a kind ofmorphological operation method and (b) adopted thinningalgorithm respectively Likewise the corresponding resultskeleton extraction in the second dataset is shown inFigure 9 where (a) adopted a kind of morphologicaloperation method and (b) adopted thinning algorithmrespectively The comparison result of the two datasets isshown in Table 1

For the raw dataset we firstly perform the image prepro-cessing over it to obtain numerical database

The results of the skeleton extraction as shown in Figures8(b) and 9(b) are more clearly and useful for the following

processing Subsequently the clustering by grid can bereadily obtained from the preprocessed imagesThe extent ofimage thinning is better and more complete by the thinningalgorithm than the Bwmorph function in MATLAB But theelapsed time is longer due to a two-layer iteration nestingprocedure in the program code

The choice of placing a grid on the image follows oneprinciple mesh segmentation is not trying to fall on a con-centrated position of traffic flow Since there is no endpointthe midpoint of the two adjacent values was considered ademarcation point Under this assumption the traffic flow ineach grid is calculated and stored digitally in an Excel file Adigital data for the trafficmap serves as the initial data for thesubsequent clustering process

52 Comparison Result of KM and HC Clustering InXLMiner two methods were used to perform clustering KMand HC In order to compare the two methods for the twodatasets input variables were normalized and the numberof clusters is set at five and maximum iterations at 100 Theinitial centroids are chosen randomly at start Furthermore

International Journal of Distributed Sensor Networks 9

Table 4 Comparison for log-likelihood of first dataset

Formats KM HC DBScan XM EMVector database minus1241868 minus1407265 minus1328599 minus119533 minus1249562Raster database minus1342238 minus1502863 minus1378889 minus129632 minus1339769RasterP (16 grids) 1262264 minus1402266 minus1248583 minus1239419 minus1244993RasterP (25 grids) minus1241868 minus1319417 minus1122207 minus1248201 minus1162048

(a) (b)

Figure 8 (a) Result of skeleton extraction in dataset 1 using Bwmorph function (b) Result of skeleton extraction in dataset 1 using thinningalgorithm

(a) (b)

Figure 9 (a) Result of skeleton extraction in dataset 2 using Bwmorph function (b) Result of skeleton extraction in dataset 2 using thinningalgorithm

Table 5 Comparison for running time of the second dataset

Formats KM HC DBScan XM EM LPVector database 139 134 1553 153 1005 337Raster database 241 1478 1834 217 823 196RasterP (16 grids) 047 801 1274 045 377 144RasterP (25 grids) 035 620 1098 036 296 118

the weights for the corresponding three attributes (119909 119910 V)for each grid (119892

119894= (119909119894 119910119894 V119894)) based on defining weight of

119909 and 119910 could be varied (fine-tuned) and the sum of weightsmust be equal to 1 We tested several variations searching forthe best clustering results (1) weight of V is 20 (2) weightof V is 40 (3) weight of V is 50 (4) weight of V is 60 (5)weight of V is 80 (6) all of them have same weight at 333

(7) weight of V is 0 (8) same weight except when 119892119894(V119894= 0)

and (9) weights of 119909 and 119910 are both 0 except when 119892119894(V119894= 0)

In HC method normalization of the input data waschosen Another option available is similarity measure Itadopts Euclidean distance to measure raw numeric dataMeanwhile the other two options Jaccardrsquos coefficients andmatching coefficient are activated only when the data isbinary

For the above nine cases results of cases (1) to (6) aresimilar in their separate methods And result of (9) is theworst which does not accomplish any clustering Results ofcases (2) (3) (7) and (8) are demonstrated in Figure 10

For the distribution of clusters in the result of KMclustering method more than half of data points are clampedinto one oversized cluster The result of this method istherefore not helpful for further operation For HC methoddata on average are allocated into separate clustersThe result

10 International Journal of Distributed Sensor Networks

Table 6 Comparison for log-likelihood of second dataset

Formats KM HC DBScan XM EMVector database minus1735412 minus1962367 minus1753576 minus1721513 minus1657263Raster database minus1815926 minus2012568 minus1970756 minus1815791 minus1848209RasterP (16 grids) minus1551437 minus1724736 minus1637147 minus1701283 minus1566231RasterP (25 grids) minus1484761 minus1663789 minus1509146 minus1667312 minus1647823

1

2

5

4

41

3

32

5

(a)

1

2 3

4

5

4 51

3

2

(b)

5

5

4

1

4

1 3

3

2

2

5

(c)

2

2

4

3

3

1 5

5

4

1

(d)

Figure 10 (a) Clustering results for the first dataset with setting case (2) where weight of V is 40 top half uses KM clustering methodand bottom half uses HC method (b) Clustering results for the first dataset with setting case (3) where weight of V is 50 top half uses KMclustering method and bottom half uses HC method (c) Clustering results for the first dataset with setting case (7) where weight of V is 0top half uses KM clustering method and bottom half uses HC method (d) Clustering results for the first dataset with setting case (3) whereall share the same weight except 119892

119894(V119894= 0) top half uses KM clustering method and bottom half uses HC method

Table 7 Comparison of running time (in seconds) of four differentsizes of dataset

Dataset size KM HC DBScan XM EM LP100 grid cells 006 007 105 219 321 0184600 grid cells 042 295 3989 273 1905 93710000 grid cells 262 4667 9755 297 3785 242180000 grid cells 1975 18961 684 647 19831 9083

in Figure 10(c) is the best showing only the one with distinctposition attributes (119909 and 119910) The other three results (Figures10(a) 10(b) and 10(d)) are stained with cluster overlapsTherefore allocation of critical resource for example in eachcluster may result in a waste of resources The degree ofoverlap is the least in the result of Figure 10(b) If only locationis being considered the result of Figure 10(c) is the bestchoice Otherwise the result in Figure 10(b) is better than theother two for the sake of cluster distribution

The clustering results of the second dataset performanceby using the two methods KM and HC are shown inFigure 11

From the results of the cluster distribution of the seconddataset obtained by both clustering methods the size of eachcluster is more or less similar which is better than that ofthe first dataset And there is no overlap phenomenon inthe KM results This is a promising feature of KM methodfor spatial clustering However there is little overlap in theresult of HC method as the clusters seem to take irregularshapes Above all for the second dataset KM is a better choicefor consideration of even cluster distribution and overlapavoidance by using both clustering methods

53 Results of Grouping In this part we compare the coloredmap of Raster (119909 119910 V) data model in two datasets usingfive clustering methods in Weka and the LP method Thecommon requirement is no overlap for each of the resultingmaps The number of cluster is arbitrarily chosen at five The

International Journal of Distributed Sensor Networks 11

Table 8 Numeric results of coverage of each cluster by using the six methods for dataset 1

Cov-db1 KM EM DBScan XM HC LPCluster 0 0029436 0003786 0017902 0075178 0013153 0028985Cluster 1 0301538 0269602 0208078 0049761 0026016 0377034Cluster 2 0215277 0001627 0158439 0084049 012436 0080099Cluster 3 0046788 0096221 0079177 0209390 0001172 0217204Cluster 4 0002712 0161799 0044197 0043152 03043 0007704Total coverage 0595751 0533036 0507793 0461531 0469 0711025

41 5

3 2

(a)

4

13

5

2

(b)

Figure 11 (a) Clustering results for the second dataset by usingKMmethod (b) Clustering results for the second dataset by usingHCmethod

result of first dataset is shown in Figure 12 The first part (i)of Figure 12 shows the spatial clustering result the secondpart (ii) visualizes the corresponding spatial groups by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method The centers of the clustersare computed after clustering is done and then the groupsare visualized over the clustering results according to thecomputed centers

In Figure 12 for the results of (a) and (e) the sizes ofclusters are quite uneven more than half of dataset fall intoone cluster Thus this result reveals a fact that the techniquecannot organize a dataset into homogeneous andor well-separated groups with respect to a distance or equivalentlya similarity measureThe corresponding groups have overlapphenomenon too For the result of (c) the sizes of the clustersare uneven too For the result of (b) and (d) the sizes of clusterseem to be similar to each other There is also no overlapin the clustering result but for group result the groups in(d) have far more overlaps than those in (b) Overlap meanssome part or the cluster gets in the way of another onewhichmeans that there is superposition between two ormoredifferent clusters Again itmay cause resourcewaste and evenfalse allocation This situation occurs in important fields ofapplications such as information retrieval (several thematicfor a single document) and biological data (several metabolicfunctions for one gene) For this reason (b) is better than (d)According to the above analysis for the result of clusteringand corresponding groups (d) XM is so far the best choiceof clustering algorithm as evidenced by the colored mapsthereafter

With the same experiment setup and operating environ-ment the spatial clustering experiments are performed overthe second dataset The results of second dataset are shown

in Figure 13 where (i) represents the spatial clustering resultand (ii) represents the corresponding spatial group by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method

In Figures 13(a) and 13(e) it is noticed that the clustersare imbalanced and there are overlaps in the correspondingspatial groups using the method of (a) EM and (e) DBScanThe results of (b) KM and (d) XM however avoid theshortcomings of (a) and (e) though they still have slightoverlaps For (c) HC we remove the empty cells in theboundary to reduce the size of dataset the clustering resultis perfect There is no overlap and clusters are balancedbetween each other But there is still overlap in the spatialgroups Thus LP method is adopted to solve this problemand in possession of same size of groups The result of LPmethod yields perfectly balanced groups without any overlapas shown in Figure 13(f)

By visually comparing the clustering results of the twodatasets the clustering results seem to be similar but thespatial groups are somewhat differentOccurrence of overlapsin spatial groups is more severe in the first dataset than in thesecond one The overlaps are likely due to data distributionand balance in sizes between each cluster For the first datasetthe heavy spatial values which are traffic volumes in this caseare mainly concentrated in the center region (city center) solocations of the computed clusters tend to cram very near ata crowded spot In contrast the traffic volumes in the seconddataset are dispersed over a large area As seen from the visualspatial groups of the second dataset the cluster positions area little far apart when compared to those in the first dataset

Based on the results generated from the clustering andLP experiments some statistic information of dataset 2 iscollected and it is shown in Table 2 The numeric results in

12 International Journal of Distributed Sensor Networks

(i)

(ii)

(a)

(i)

(ii)

(b)

(i)

(ii)

(c)

(ii)

(i)

(d)

(i)

(ii)

(e)

Figure 12 (a) (i) Spatial clustering on dataset 1 by using EM (ii) Spatial groups on dataset 1 from the results of using EM (b) (i) spatialclustering on dataset 1 by using KM (ii) Spatial groups on dataset 1 from the results of using KM (c) (i) spatial clustering on dataset 1 byusing HC (ii) Spatial groups on dataset 1 from the results of using HC (d) (i) spatial clustering on dataset 1 by using XM (ii) Spatial groupson dataset 1 from the results of using XM (e) (i) spatial clustering on dataset 1 by using DBScan (ii) Spatial group on dataset 1 from theresults of using DBScan

Table 3 support the qualitative analysis by visual inspectionin the previous section By comparingHC and LPmethods asan example the quantitative results show that they have thegreatest differences in cell numbers covered by the clustersalso the amount of overlap in HC is the highest of all By theLP method the size of each cluster is exactly the same andthey are totally free from overlap

6 Technical Analysis of Clustering Results

61 Experimental Evaluation Method For the purpose ofassessing how the qualities of spatial groups from clusteringare several evaluation factors are defined here running time(short for time) balance log-likelihood overlap densityand coverage For a fair comparison the datasets are run inthe same software environment of the same computer And

assume the number of groups to be five with six differentmethods Running time is the time we used to run eachmethod using the same software in the same computerto completion Balance is used to measure the sizes ofgroups if balanced the size of each group is the same Log-likelihood is an important measure for clustering qualitythe bigger the value the better Weka tests for goodness-of-fit by the likelihood in logarithm which is called log-likelihood A large log-likelihood means that the clusteringmodel is suitable for the data under test Overlap means thatthe spatial values (eg traffic volumes sensed by the sensors)do belong to more than one cluster Density is the averagespatial values (traffic volumes) per grid cell in each clusterCoverage of a cluster means the proportion of traffic volumesthat are covered by the grid cells within the cluster overthe whole dataset meanwhile total coverage is the sum of

International Journal of Distributed Sensor Networks 13

(i)

(ii)

(a)

(i)

(ii)

(b)

(i)

(ii)

(c)

(i)

(ii)

(d)

(i)

(ii)

(e)

(f)

Figure 13 (a) (i) spatial clustering on dataset 2 by using EM (ii) Spatial groups on dataset 2 from the results of using EM (b) (i) spatialclustering on dataset 2 by using KM (ii) Spatial groups on dataset 2 from the results of using KM (c) (i) spatial clustering on dataset 2 byusing HC (ii) Spatial groups on dataset 2 from the results of using HC (d) (i) spatial clustering on dataset 2 by using XM (ii) Spatial groupson dataset 2 from the results of using XM (e) (i) spatial clustering on dataset 2 by using DBScan (ii) Spatial group on dataset 2 from theresults of using DBScan (f) Spatial group in LP method on dataset 2

14 International Journal of Distributed Sensor Networks

traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below

Density (cluster 119894) =sumTraffic Volumes (cluster 119894)

Grid Cell Number (cluster 119894)

Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)

sumGrid Cell Number

Total Coverage = sumTraffic Volumes minusOverlaps

Proportion of Cluster (119894) Size (Balance)

=Grid Cell Number (cluster 119894)

sumGrid Cell Number

(4)

62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively

According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest

In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one

In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14

According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On

0

100

200

300

400

500

600

700

800

900

0 20000 40000 60000 80000 100000

K-meansHierarchicalDBScanXMeanEM

Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP

K-means)

Figure 14 Comparison of running time (in seconds) of differentsizes of dataset

the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally

In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one

In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC

The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets

From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage

International Journal of Distributed Sensor Networks 15

Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2

Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700

0

01

02

03

04

05

06

07

08

KM EM DBScan XM HC LP

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

CoverageCoverage

(a)

0

01

02

03

04

05

06

KM EM DBScan XM HC LP

Coverage

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

Coverage

(b)

Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2

Total density

0

200

400

600

800

1000

1200

1400

1600

KM EM DBScan XM HC LP

Density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(a)

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

KM EM DBScan XM HC LP

Density

Total density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(b)

Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2

16 International Journal of Distributed Sensor Networks

4

5136

8

1

Balance test on dataset 1

(a) KM

1

50

1

18

30

Balance test on dataset 1

(b) XM

6

22

24

30

18

Balance test on dataset 1

(c) EM

24

24

17

20

15

Balance test on dataset 1

(d) DBScan

18

17

22

19

25

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP

From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means

that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered

International Journal of Distributed Sensor Networks 17

17

18

17

24

24

Balance test on dataset 2

(a) KM

24

18

24

18

17

Balance test on dataset 2

(b) XM

47

032

47

Balance test on dataset 2

(c) EM

1010

98

Balance test on dataset 2

(d) DBScan

23

15

24

18

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP

The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely

63 Discussion of G119899119890119905 For all the six evaluation factors each

of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in

18 International Journal of Distributed Sensor Networks

Table 10 Numeric results of density of each cluster by using the six methods for dataset 1

Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049

Table 11 Numeric results of density of each cluster by using the six methods for dataset 2

Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447

order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866

119888can take a relatively very large value or even

1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes

119866119897=

10038161003816100381610038161003816100381610038161003816

LikelihoodTime

10038161003816100381610038161003816100381610038161003816

(5)

119866119887=Difference of Balance

Time (6)

119866119889=DensityTime

(7)

119866119888=CoverageTime

(8)

119866119900=OverlapTime

(9)

119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)

Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)

From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in

Table 12 which allows us to easily compare various methodsand performance aspects

In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866

119897 119866119887 119866119889 119866119888 and 119866

119900are computed

for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13

According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion

7 Conclusion and Future Works

Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for

International Journal of Distributed Sensor Networks 19

Table 12 Performance indicators of the six methods based on dataset 2

Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0

Table 13 Comparison of different clustering and LP methods by119866net indicator

Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132

purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users

The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes

For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be

good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed

References

[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000

[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012

[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012

[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012

[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012

[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003

[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006

[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002

[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008

[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006

[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004

[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998

[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000

20 International Journal of Distributed Sensor Networks

[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004

[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005

[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010

[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998

[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006

[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999

[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001

[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009

[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999

[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000

[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996

[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963

[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

International Journal of Distributed Sensor Networks 3

Nevertheless the spatial groups obtained by classic algo-rithms have certain limitations that is overlaps cannot becontrolled and the maximum coverage by the resultantgroups is not guaranteed Overlaps lead to resource wasteand potentially resourcemismatch Besides spatial clusteringthis situation occurs in other fields of applications suchas the information retrieval (several thematic for a singledocument) biological data (several metabolic functions forone gene) and martial purpose (discover object densenessregions independently) However there has been no studyreported in the literature that the authors are aware ofthat applies LP method to discover spatial groups that arefree of the limitations inherited from clustering algorithmsThus this research provides an alternative method to achievespatial groups for maximum coverage in real environmentMaximum coverage in this context here is defined as thegreatest possible area of effect covered by the spatial groupswith none or minimum overlaps among the groups

3 Spatial Data Representation

Two main categories of spatial data representation existspatial data and attribute data Spatial data means refer-enced data in the earth such as maps photographs andsatellite imageries Though these representation techniquesoriginated from GIS the underlying coding formats arecommon compared to those for wireless sensor networksas long as they are distributed over a wide spatial area innature Generally spatial data represents geographic featuresin complete and relative locations Attribute data representsthe spatial features in characteristics which can be in quantityandor in quality in real world Attribute data is often referredto as tabular data In our experiments we test both typesof data models versus different clustering algorithms for athorough investigation

31 Spatial Data Model In early days spatial data is storedand represented in a map format There are three fundamen-tal types of spatial data models for recording the geographicdata digitally They are vector raster and image

Figure 1 as shown in the following illustrates the encodingtechniques of two important spatial data [19] which are rasterand vector over a sample aerial image of Adriatic Sea andcoast in Italy The image type of encoding is very similar toraster data in terms of usability of techniques But it has a limitin internal formats when it comes to modeling and analysisof the data Images represent photographs or pictures in thelandscape in a coarse matrix of pixel values

32 Vector Data Model The three kinds of a forementionedspatial data models are used in storing the geographiclocation with spatial features in dataset The vector datamodel uses 119909 119910 coordinates to define the locations offeatures thereafter theymark points lines areas or polygonsTherefore vector data tend to define centers edges andoutlines of features It characterizes the features by linearsegments using sequential points or vertices A vertex consistsof a pair of 119909 and 119910 coordinates The beginning or ending

of a node is defined in each vertex with arc segment Asingle coordinate pair of vertexes defines a feature point Agroup of coordinate pairs define polygonal features In vectorrepresentation as well as the connectivity between featuresthe storage of the vertices for each feature is important as wellas the sharing of common vertices where features connect

By using the same size polygonal we divide a completemap into small units based on the character of our databasewhich is represented to be (119909 119910 V) where 119909 and 119910 consistof an coordinate pair that represents the referenced spatialposition and V represents something of interest or just calledldquofeaturerdquo which could be a military target a critical resourceor just an inhabitant clan for example The greater the V themore valuable the feature is In spatial grouping formaximumcoverage we opt to include these features that amount to ahighest total value A sample of vector format that representsa spatial location in reference to 2D is shown in Figure 2 [19]

33 RasterDataModel Raster datamodelsmake use of a gridof squares to define where features are located These squareswhich are also called pixels or cells typically are of uniformsize

From our dataset we separate the whole image by impos-ing a grid on it hence producing many individual featureswith one feature corresponding to each cell We considerusing raster data model to represent the dataset and we storethe features by the following two different encoding formats

(1) Raster data are stored as an ordered list of cell valuesin pairs of (119894 V) where 119894 is a sequential number of thecell indices and V is the value of the 119894th feature forexample (1 80) (2 80) (3 74) (4 62) (5 45) and soon as shown in Figure 3

(2) Raster data are stored as points (119909 119910 V) with 119909 and 119910

as position coordinates locating to the correspondingspatial feature with value V for example (1 1 513) (12 514) (1 3 517) (2 1 512) (2 2 515) and so on asshown in Figure 4 In this case the value V refers tothe center point of the grid cell This encoding will beuseful for representing measured values at the centerpoint of the cell for example raster of elevation

(3) During the experiment the grid size is transformedfor efficient operation So we put 1198942 cells together asone unit representing one new grid cell as shown inFigure 5

In particular the quad tree data structure for storingthe data is found to be useful as an alternative encodingmethod to raster data model Raster embraces digital aerialphotographs imagery from satellites digital pictures or evenscanned maps Details on how different sorts of objects likepoint line polygon and terrain are represented by the datamodels can be found in [19ndash21]

4 Proposed Methodology

Theaim of themethodology is to determine a certain numberof clusters and their corresponding locations from some

4 International Journal of Distributed Sensor Networks

Real world Vector Rasterimage

Figure 1 Representation of how a real-world spatial area is represented by vector and raster encoding formats

13121110

987654321

0 1 2 3 4 5 6 7 8 9 10 11 12 13Width

Hei

ght

Colu

mn

Row

x y coordinates are 9 3

Figure 2 Vector format

80 74 62 45 45 34 39 56

80 74 74 62 45 34 39 56

74 74 62 62 45 34 39 39

62 62 45 45 34 34 34 39

45 45 45 34 34 30 34 39

Figure 3 Raster format in ordered list

collected spatial data In this process different methods aretested for choosing the one which covers themost area as wellas the highest feature values from the suggested clustersTheflow of this process including preprocessing of sensor data

515 519 521 523

523

523517 528 527

512 510 520

518511 512 516

514

517

511

510

512 516 517 520513

Figure 4 Raster data with center point

80

80

74

62

45

74

74

74

62

45

62

74

62

45

45

45

62

62

45

34

45

45

45

34

34

34

34

34

34

30

39

39

39

34

34

56

56

39

39

39

Figure 5 Raster format with 22 and 32 grids

data transformation clustering and finding cluster center-points is shown in Figure 6

In case of a satellite image or image captured by fighter-jetor other surveillance camera image processing is needed to

International Journal of Distributed Sensor Networks 5

Load spatial image

RGB image

Gray imageSkeleton extraction

Morphological operation in MATLAB (Bwmorph)

Zhangrsquos algorithmare used forcomparison

Two-tone image

Indexed grid image

2D special data

Method comparison

Griddingindexing image

Numerical dataset (with normalization)

Spatial grouping

Hierarchical K-means LPDBScan

Color map

Output

2 algorithms

middot middot middot

Preprocessing of imageData transformation

GroupingDisplay

Figure 6 Workflow of proposed methodology

extract the density information from pictures But in our caseof sensor network we can safely assume that the data fed fromanet of sensorswould have the sensor ID attachedThe sensorIDs are known so are their positions From the locations ofthe sensors and their sensor ID we could possibly relate thedata that was collected to their corresponding locations inthe 119909-119910 format of coordinates (assume the terrain is of 2D)In order to reduce the huge amount of calculation and storagespace a grid was used to divide the whole map into smallerpieces The grid indexing operation is repeated for a range ofdifferent coarse layers thereby providing different resolutionsof data partitions Similar technique is reported in [22] whichis computed by Euclidian distance Obviously the methodof grid indexing helps separate data into cells based on theirgeographic locations

To obtain a better result of spatial groups for maximumcoverage and its corresponding cluster center point with cer-tain constrains the research adopts several popular cluster-ing methods and linear programming method by using soft-ware programs such as XLMiner (httpwwwsolvercomxlminer-data-mining) MATLAB (httpwwwmathworkscomproductsmatlab) and Weka (httpwwwcswaikatoacnzmlweka)

The core purpose of cluster analysis is to comprehend andto distinguish the extent of similarity or dissimilarity amountof the independent clustered objects There are five majormethods of clusteringmdashKM EM XM HC and DBScan

119870-means (KM) byMacQueen 1967 is one of the simplestalgorithms that solve thewell-known clustering problem [23]It is an easy and simple method to divide a dataset into acertain number of clusters initially assuming that the numberof clusters is 119896 fixed a priori for each cluster which is themainidea The random choice of the initial location of centroids

leads to various results A better choice is to place them asmuch far away from each other as possible

The KM algorithm aims at minimizing an objective fun-ction In this case a squared error function is as follows

119895 = sum

forall119894

sum

forall119895

10038171003817100381710038171003817119909119894(119895) minus 119888

119895

10038171003817100381710038171003817

2

(1)

where 119895 ranges from 1 to 119896 119894 range from 1 to 119899 and119909119894(119895) minus 119888

1198952 is a chosen distance measure between a data

point 119909119894(119895) and the cluster center 119888

119895 which is an indicator of

the distance of the 119899 data points from their respective clustercenters The sum of distances or sum of squared Euclideandistances from the mean of each cluster is a quite normal orusual measure for causing scattering in all directions in thecluster in order to test the suitability of the KM algorithmClusters are often computed using a fast heuristic methodwhich generally produces good (but not necessarily optimal)solutions

X-Means [24] is an optimal method of KM whichimproves structure part in the algorithm Division of thecenters is attempted in its region It makes decision betweenthe root and children of each center to doing the comparisonbetween the two structures Another improved variant ofKM called EM which execrates maximization makes anassignment on a probability distribution to each further pointwhich represents the probability Howmany clusters to be setup are to be decided by EM using cross-validation

Density-based algorithms regard clusters as dense areasof objects that are separated by less dense areas [25] Becausethey have no limit to look for clusters with spherical shapethey can produce clusters with arbitrary shapes DBScan isa typical implementation of density-based algorithms calleddensity-based spatial clustering of applications with noise

6 International Journal of Distributed Sensor Networks

(a) (b) (c)

Figure 7 Illustration of possible ways of assigning clusters for maximum (a) fish population (b) altitude of terrain and (c) human inhabitantpopulation

[25] The notions of density reachability and density con-nectivity are used as performance indicators for the qualityof clustering [26] A cluster is composed of the group ofobjects in a dataset that are density connected to a particularcenter Any object that falls beyond a cluster is considered asnoise

Ward proposed a clustering method called hierarchicalclustering (HC) in 1963 [27] It tries to find how to formsomething to divide 119875

119899 119875119899minus1

1198751in a way that reduces

the relationship with each group In each step analysisstep it considered every possible cluster pair in group andcombined the two clusters with a very close joining of resultsin ldquoinformation lossrdquo which is given definition by Wardaround ESS (an error sum-of-squares criterion)The idea thatsupports Wardrsquos proposal can be described most simply bythinking of a little single data Take ten objects with scores asan example (2 7 6 6 7 2 2 0 2 0) The loss of informationwould be achieved by calculating ESS with a mean of 34which takes into account the ten scores as a unit as followsESS One group = (2 minus 34)

2+ (7 minus 34)

2+ sdot sdot sdot + (0 minus 34)

2=

4728 However those 10 objects can also be separated intofour groups according to their scores 0 0 0 2 2 2 26 6 and 7 7 Finally for evaluation of the ESS as a sum ofsquares we can obtain four independent error sums of eachsquare Overall the result that divides the 10 objects into 4clusters has no loss of information as follows

ESS One group = ESS group1 + ESS group2

+ ESS group3 + ESS group4 = 0

(2)

The last method we adopted here is linear programming(LP) which contains instituting and producing an answerto optimization problems with linear objective functionsand linear constraints This powerful tool can be used inmany fields especially where many options are possible inthe answers In spatial grouping over a large grid manypossible combinations of positioning the clusters exist Theproblem here is to find a certain number of clusters of

equal size over the area meanwhile the chosen centers ofthe clusters must allow sufficient distance apart from eachother so as to avoid overlapping As an example shownin Figure 7 three clusters would have to be assigned overa spatial area in a way that they would have to covercertain resources The assignment of the clusters howeverwould have to yield a maximum total value summed fromcovered resources In the example the left diagram showsallocating three clusters over the deepwater assuming that theresources are fish hence the maximum harvest The secondexample in the middle of Figure 7 is clustering the highaltitude over the area The last example is trying to coverthe maximum human inhabitants which are concentratedat the coves Given many possible ways of setting up theseclusters LP is used to formulate this allocation problemwith an objective of maximizing the values of the coveredresources

Assuming that the resources could be dynamic forexample animal herds or moving targets whose positionsmay swarm and change over time the optimization is atypical maximal flow problem (or max flow problem) Theoptimization is a type of network flow problem in whichthe goal is to determine the maximum amount of flowthat can occur over arc whish is limited by some capacityrestriction This type of network might be used to modelthe flow of oil in pipeline (in which the amount of oil thatcan flow through a pipe in a unit of time is limited by thediameter of the pipe) Traffic engineers also use this type ofnetwork to determine the maximum number of cars that cantravel through a collection of streets with different capacitiesimposed by the number of lanes in the streets and speed limits[28]

For our spatial clustering we consider each cell of the gridas a node each node is defined as a tuple119898 that contains thecoordinates and the value of the resource that is held in thenode such that 119898(119909

119894 119910119894 119911119894) represents an 119894th node in which

119909119894 119910119894represent the position and 119911

119894represents the value of

resource in the node respectively For the clusters each node

International Journal of Distributed Sensor Networks 7

(1) Load the grid-based spatial information into array 119860(119909 119910 119911) 119860 is a three dimensional array(2) Repeat (through all coordinates of 119909)(3) Repeat (through all coordinates of 119910)(4) If (boundary constraints and overlapping constraints are satisfied) Then(5) 119878(119909

119894 119910119894 119911119894) = 119860(119909

119894 119910119894 119911119894)

(6) End-if(7) End-loop(8) End-loop(9) If size of (119878) ge 119870

(10) Find top 119870 clusters where maxsum119911119894⨁119862119896 copy 119878(119909

119894 119910119894 119911119894) to new array 119862(119909

119894 119910119894 119911119894) forall119894 isin 119862

119896

(11) Else-if(12) 119862(119909

119894 119910119894 119911119894) = 119878(119909

119894 119910119894 119911119894) forall119894

(13) End-if

Pseudocode 1 Pseudocode of the proposed LP model for spatial clustering

can potentially be a center of a cluster and the cluster hasa fixed radius of length 119903 The LP model for our problem ismathematically shown as follows

Total value = ⋃

selected clusters ⟨119862119896|119896=1sdotsdotsdot119870⟩sum

119898119894isin119862119896

119898119894(lowast lowast 119911

119894)

= argmax119883119884

sum

0le119909119894le119883

0le119910119894le119884

119870

sum

119896=1

119911119897ni 119898119897(119909119894 119910119895 119911) oplus 119888

119896

(3)

Subject to the boundary constraints of 2r le |119909119894minus 119909119895| and 2r

le |119909119894minus 119909119895| for all 119894 and 119895 but 119894 = 119895 where119883 is the maximum

width and 119884 is the maximum length of the 2D spatial arearespectively 119896 isin 119870 is the maximum number of clusters and119888119896is the 119896th cluster under consideration in the optimizationIn order to implement the computation as depicted in

(3) for each node we sum each group resources in a shapeof diamond (which geometrically approximates a circle) Byiterating through every combination of119870 nodes in the grid ofsize 119883 by 119884 each current node in the combinations is beingtested by considering it as the center of a cluster that has aradius of r hence storing the resource values of the nodesfrom the potential clusters into a temporary array buffer119860(lowast lowast 119911

119894) The results from those potential clusters which

do satisfy the boundary and nonoverlapping constraints arethen copied to a candidate buffer Out of the clusters whoseresource values are stored in the candidate buffer 119878 thecombination of 119870 clusters that has the great total resourcevalue is selected and their values are placed in the final buffer119862 The corresponding pseudocode is shown in Pseudocode 1

5 Experimental Results and Analysis

In this section the performance of the proposed methodol-ogy is shown by presenting both numerical and visualizedresults for all performance aspects over various algorithms Acase study of road traffic is used in the experimentThe spatialarea is a metropolitan traffic map with roads and streetsspanning all over the place The resource value in this case isthe concentration or density of vehicle traffic flows Sensorsare assumed to have been deployed in every appropriate point

Table 1 Comparison between Bwmorph function and thinningalgorithm

Bwmorph function Thinning algorithmDataset 1 Dataset 2 Dataset 1 Dataset 2

Degree ofthinning Incomplete Complete

Elapsed time(secs) 20 38 100 198

Complexity 119874(119899) 119874(1198992)

of the roads thereby a typical traffic volume is each of thesepoints is known The optimization of spatial clustering inthis case can be thought of as optimal resource allocationfor example cost-effective police patrols gas stations orenvironment-pollution controls are needed among thosedense traffic spots

51 Data Preprocessing Two different factual datasets areused for experiments The first dataset is published byMaricopa Association of Governments in 2008 which isa traffic volume map Traffic volumes were derived fromthe national traffic recording devices Seasonal variation isfactored into the volumes The second dataset is an annualaverage daily traffic of Baltimore County Traffic VolumeMapin 2011 in USA prepared by the Maryland Department ofTransportation and published by March 19 2012 The trafficcount estimates are derived by taking 48-hourmachine countdata and applying factors frompermanent count stationsThetraffic counts represent the resource values in a general sense

After using skeleton extraction a two-tone image wasobtained from the original map Readers are referred to therespective websites where they can see the traffic volume datathat are associated with our two datasets (a) Representativetraffic volume map of dataset 1mdashTraffic Volume Map ofPhoenix AZUSA (httpphoenixgovstreetstrafficvolume-map) (b) Representative traffic volume map of dataset2mdashTraffic Volume Map of Baltimore MD USA (httpwwwmarylandroadscomTraffic Volume MapsTraffic VolumeMapspdf) And the corresponding result skeleton extraction

8 International Journal of Distributed Sensor Networks

Table 2 Important statistics from the clustering and LP experiments

Method Cluster number Number of cells covered Minimum Maximum Overlap

KM

Cluster 1 428 0 3499327 0Cluster 2 468 0 546896 0Cluster 3 448 0 20503007 0Cluster 4 614 0 6894667 0Cluster 5 618 0 900908 0

XM

Cluster 1 615 0 591265 0Cluster 2 457 0 546896 0Cluster 3 609 0 900908 0Cluster 4 465 0 3499327 0Cluster 5 430 0 20503007 0

EM

Cluster 1 1223 0 2292 61817229Cluster 2 7 141048 243705 313018Cluster 3 81 0 3033733 131146577Cluster 4 64 26752 546896 330881249Cluster 5 1201 0 1300026 217950471

DB

Cluster 1 13 23614 33146 327222911Cluster 2 11 1686825 21001 363965818Cluster 3 13 178888 2945283 196118393Cluster 4 11 847733 211008 58940877Cluster 5 2528 0 546896 20554176

HC

Cluster 1 291 0 3499327 0Cluster 2 191 0 20503007 96762283Cluster 3 294 0 1590971 0Cluster 4 224 0 189812 12673555Cluster 5 243 0 546896 0

LP

Cluster 1 221 0 3499327 0Cluster 2 221 0 20503007 0Cluster 3 221 0 1590971 0Cluster 4 221 0 189812 0Cluster 5 221 0 546896 0

Table 3 Comparison for running time of the first dataset

Formats KM HC DBscan XM EM LPVector database 327 1252 2324 278 930 183Raster database 342 1536 2820 284 984 201RasterP (16 grids) 198 134 508 046 057 078RasterP (25 grids) 009 014 115 021 012 053

in dataset 1 is shown in Figure 8 where (a) adopted a kind ofmorphological operation method and (b) adopted thinningalgorithm respectively Likewise the corresponding resultskeleton extraction in the second dataset is shown inFigure 9 where (a) adopted a kind of morphologicaloperation method and (b) adopted thinning algorithmrespectively The comparison result of the two datasets isshown in Table 1

For the raw dataset we firstly perform the image prepro-cessing over it to obtain numerical database

The results of the skeleton extraction as shown in Figures8(b) and 9(b) are more clearly and useful for the following

processing Subsequently the clustering by grid can bereadily obtained from the preprocessed imagesThe extent ofimage thinning is better and more complete by the thinningalgorithm than the Bwmorph function in MATLAB But theelapsed time is longer due to a two-layer iteration nestingprocedure in the program code

The choice of placing a grid on the image follows oneprinciple mesh segmentation is not trying to fall on a con-centrated position of traffic flow Since there is no endpointthe midpoint of the two adjacent values was considered ademarcation point Under this assumption the traffic flow ineach grid is calculated and stored digitally in an Excel file Adigital data for the trafficmap serves as the initial data for thesubsequent clustering process

52 Comparison Result of KM and HC Clustering InXLMiner two methods were used to perform clustering KMand HC In order to compare the two methods for the twodatasets input variables were normalized and the numberof clusters is set at five and maximum iterations at 100 Theinitial centroids are chosen randomly at start Furthermore

International Journal of Distributed Sensor Networks 9

Table 4 Comparison for log-likelihood of first dataset

Formats KM HC DBScan XM EMVector database minus1241868 minus1407265 minus1328599 minus119533 minus1249562Raster database minus1342238 minus1502863 minus1378889 minus129632 minus1339769RasterP (16 grids) 1262264 minus1402266 minus1248583 minus1239419 minus1244993RasterP (25 grids) minus1241868 minus1319417 minus1122207 minus1248201 minus1162048

(a) (b)

Figure 8 (a) Result of skeleton extraction in dataset 1 using Bwmorph function (b) Result of skeleton extraction in dataset 1 using thinningalgorithm

(a) (b)

Figure 9 (a) Result of skeleton extraction in dataset 2 using Bwmorph function (b) Result of skeleton extraction in dataset 2 using thinningalgorithm

Table 5 Comparison for running time of the second dataset

Formats KM HC DBScan XM EM LPVector database 139 134 1553 153 1005 337Raster database 241 1478 1834 217 823 196RasterP (16 grids) 047 801 1274 045 377 144RasterP (25 grids) 035 620 1098 036 296 118

the weights for the corresponding three attributes (119909 119910 V)for each grid (119892

119894= (119909119894 119910119894 V119894)) based on defining weight of

119909 and 119910 could be varied (fine-tuned) and the sum of weightsmust be equal to 1 We tested several variations searching forthe best clustering results (1) weight of V is 20 (2) weightof V is 40 (3) weight of V is 50 (4) weight of V is 60 (5)weight of V is 80 (6) all of them have same weight at 333

(7) weight of V is 0 (8) same weight except when 119892119894(V119894= 0)

and (9) weights of 119909 and 119910 are both 0 except when 119892119894(V119894= 0)

In HC method normalization of the input data waschosen Another option available is similarity measure Itadopts Euclidean distance to measure raw numeric dataMeanwhile the other two options Jaccardrsquos coefficients andmatching coefficient are activated only when the data isbinary

For the above nine cases results of cases (1) to (6) aresimilar in their separate methods And result of (9) is theworst which does not accomplish any clustering Results ofcases (2) (3) (7) and (8) are demonstrated in Figure 10

For the distribution of clusters in the result of KMclustering method more than half of data points are clampedinto one oversized cluster The result of this method istherefore not helpful for further operation For HC methoddata on average are allocated into separate clustersThe result

10 International Journal of Distributed Sensor Networks

Table 6 Comparison for log-likelihood of second dataset

Formats KM HC DBScan XM EMVector database minus1735412 minus1962367 minus1753576 minus1721513 minus1657263Raster database minus1815926 minus2012568 minus1970756 minus1815791 minus1848209RasterP (16 grids) minus1551437 minus1724736 minus1637147 minus1701283 minus1566231RasterP (25 grids) minus1484761 minus1663789 minus1509146 minus1667312 minus1647823

1

2

5

4

41

3

32

5

(a)

1

2 3

4

5

4 51

3

2

(b)

5

5

4

1

4

1 3

3

2

2

5

(c)

2

2

4

3

3

1 5

5

4

1

(d)

Figure 10 (a) Clustering results for the first dataset with setting case (2) where weight of V is 40 top half uses KM clustering methodand bottom half uses HC method (b) Clustering results for the first dataset with setting case (3) where weight of V is 50 top half uses KMclustering method and bottom half uses HC method (c) Clustering results for the first dataset with setting case (7) where weight of V is 0top half uses KM clustering method and bottom half uses HC method (d) Clustering results for the first dataset with setting case (3) whereall share the same weight except 119892

119894(V119894= 0) top half uses KM clustering method and bottom half uses HC method

Table 7 Comparison of running time (in seconds) of four differentsizes of dataset

Dataset size KM HC DBScan XM EM LP100 grid cells 006 007 105 219 321 0184600 grid cells 042 295 3989 273 1905 93710000 grid cells 262 4667 9755 297 3785 242180000 grid cells 1975 18961 684 647 19831 9083

in Figure 10(c) is the best showing only the one with distinctposition attributes (119909 and 119910) The other three results (Figures10(a) 10(b) and 10(d)) are stained with cluster overlapsTherefore allocation of critical resource for example in eachcluster may result in a waste of resources The degree ofoverlap is the least in the result of Figure 10(b) If only locationis being considered the result of Figure 10(c) is the bestchoice Otherwise the result in Figure 10(b) is better than theother two for the sake of cluster distribution

The clustering results of the second dataset performanceby using the two methods KM and HC are shown inFigure 11

From the results of the cluster distribution of the seconddataset obtained by both clustering methods the size of eachcluster is more or less similar which is better than that ofthe first dataset And there is no overlap phenomenon inthe KM results This is a promising feature of KM methodfor spatial clustering However there is little overlap in theresult of HC method as the clusters seem to take irregularshapes Above all for the second dataset KM is a better choicefor consideration of even cluster distribution and overlapavoidance by using both clustering methods

53 Results of Grouping In this part we compare the coloredmap of Raster (119909 119910 V) data model in two datasets usingfive clustering methods in Weka and the LP method Thecommon requirement is no overlap for each of the resultingmaps The number of cluster is arbitrarily chosen at five The

International Journal of Distributed Sensor Networks 11

Table 8 Numeric results of coverage of each cluster by using the six methods for dataset 1

Cov-db1 KM EM DBScan XM HC LPCluster 0 0029436 0003786 0017902 0075178 0013153 0028985Cluster 1 0301538 0269602 0208078 0049761 0026016 0377034Cluster 2 0215277 0001627 0158439 0084049 012436 0080099Cluster 3 0046788 0096221 0079177 0209390 0001172 0217204Cluster 4 0002712 0161799 0044197 0043152 03043 0007704Total coverage 0595751 0533036 0507793 0461531 0469 0711025

41 5

3 2

(a)

4

13

5

2

(b)

Figure 11 (a) Clustering results for the second dataset by usingKMmethod (b) Clustering results for the second dataset by usingHCmethod

result of first dataset is shown in Figure 12 The first part (i)of Figure 12 shows the spatial clustering result the secondpart (ii) visualizes the corresponding spatial groups by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method The centers of the clustersare computed after clustering is done and then the groupsare visualized over the clustering results according to thecomputed centers

In Figure 12 for the results of (a) and (e) the sizes ofclusters are quite uneven more than half of dataset fall intoone cluster Thus this result reveals a fact that the techniquecannot organize a dataset into homogeneous andor well-separated groups with respect to a distance or equivalentlya similarity measureThe corresponding groups have overlapphenomenon too For the result of (c) the sizes of the clustersare uneven too For the result of (b) and (d) the sizes of clusterseem to be similar to each other There is also no overlapin the clustering result but for group result the groups in(d) have far more overlaps than those in (b) Overlap meanssome part or the cluster gets in the way of another onewhichmeans that there is superposition between two ormoredifferent clusters Again itmay cause resourcewaste and evenfalse allocation This situation occurs in important fields ofapplications such as information retrieval (several thematicfor a single document) and biological data (several metabolicfunctions for one gene) For this reason (b) is better than (d)According to the above analysis for the result of clusteringand corresponding groups (d) XM is so far the best choiceof clustering algorithm as evidenced by the colored mapsthereafter

With the same experiment setup and operating environ-ment the spatial clustering experiments are performed overthe second dataset The results of second dataset are shown

in Figure 13 where (i) represents the spatial clustering resultand (ii) represents the corresponding spatial group by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method

In Figures 13(a) and 13(e) it is noticed that the clustersare imbalanced and there are overlaps in the correspondingspatial groups using the method of (a) EM and (e) DBScanThe results of (b) KM and (d) XM however avoid theshortcomings of (a) and (e) though they still have slightoverlaps For (c) HC we remove the empty cells in theboundary to reduce the size of dataset the clustering resultis perfect There is no overlap and clusters are balancedbetween each other But there is still overlap in the spatialgroups Thus LP method is adopted to solve this problemand in possession of same size of groups The result of LPmethod yields perfectly balanced groups without any overlapas shown in Figure 13(f)

By visually comparing the clustering results of the twodatasets the clustering results seem to be similar but thespatial groups are somewhat differentOccurrence of overlapsin spatial groups is more severe in the first dataset than in thesecond one The overlaps are likely due to data distributionand balance in sizes between each cluster For the first datasetthe heavy spatial values which are traffic volumes in this caseare mainly concentrated in the center region (city center) solocations of the computed clusters tend to cram very near ata crowded spot In contrast the traffic volumes in the seconddataset are dispersed over a large area As seen from the visualspatial groups of the second dataset the cluster positions area little far apart when compared to those in the first dataset

Based on the results generated from the clustering andLP experiments some statistic information of dataset 2 iscollected and it is shown in Table 2 The numeric results in

12 International Journal of Distributed Sensor Networks

(i)

(ii)

(a)

(i)

(ii)

(b)

(i)

(ii)

(c)

(ii)

(i)

(d)

(i)

(ii)

(e)

Figure 12 (a) (i) Spatial clustering on dataset 1 by using EM (ii) Spatial groups on dataset 1 from the results of using EM (b) (i) spatialclustering on dataset 1 by using KM (ii) Spatial groups on dataset 1 from the results of using KM (c) (i) spatial clustering on dataset 1 byusing HC (ii) Spatial groups on dataset 1 from the results of using HC (d) (i) spatial clustering on dataset 1 by using XM (ii) Spatial groupson dataset 1 from the results of using XM (e) (i) spatial clustering on dataset 1 by using DBScan (ii) Spatial group on dataset 1 from theresults of using DBScan

Table 3 support the qualitative analysis by visual inspectionin the previous section By comparingHC and LPmethods asan example the quantitative results show that they have thegreatest differences in cell numbers covered by the clustersalso the amount of overlap in HC is the highest of all By theLP method the size of each cluster is exactly the same andthey are totally free from overlap

6 Technical Analysis of Clustering Results

61 Experimental Evaluation Method For the purpose ofassessing how the qualities of spatial groups from clusteringare several evaluation factors are defined here running time(short for time) balance log-likelihood overlap densityand coverage For a fair comparison the datasets are run inthe same software environment of the same computer And

assume the number of groups to be five with six differentmethods Running time is the time we used to run eachmethod using the same software in the same computerto completion Balance is used to measure the sizes ofgroups if balanced the size of each group is the same Log-likelihood is an important measure for clustering qualitythe bigger the value the better Weka tests for goodness-of-fit by the likelihood in logarithm which is called log-likelihood A large log-likelihood means that the clusteringmodel is suitable for the data under test Overlap means thatthe spatial values (eg traffic volumes sensed by the sensors)do belong to more than one cluster Density is the averagespatial values (traffic volumes) per grid cell in each clusterCoverage of a cluster means the proportion of traffic volumesthat are covered by the grid cells within the cluster overthe whole dataset meanwhile total coverage is the sum of

International Journal of Distributed Sensor Networks 13

(i)

(ii)

(a)

(i)

(ii)

(b)

(i)

(ii)

(c)

(i)

(ii)

(d)

(i)

(ii)

(e)

(f)

Figure 13 (a) (i) spatial clustering on dataset 2 by using EM (ii) Spatial groups on dataset 2 from the results of using EM (b) (i) spatialclustering on dataset 2 by using KM (ii) Spatial groups on dataset 2 from the results of using KM (c) (i) spatial clustering on dataset 2 byusing HC (ii) Spatial groups on dataset 2 from the results of using HC (d) (i) spatial clustering on dataset 2 by using XM (ii) Spatial groupson dataset 2 from the results of using XM (e) (i) spatial clustering on dataset 2 by using DBScan (ii) Spatial group on dataset 2 from theresults of using DBScan (f) Spatial group in LP method on dataset 2

14 International Journal of Distributed Sensor Networks

traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below

Density (cluster 119894) =sumTraffic Volumes (cluster 119894)

Grid Cell Number (cluster 119894)

Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)

sumGrid Cell Number

Total Coverage = sumTraffic Volumes minusOverlaps

Proportion of Cluster (119894) Size (Balance)

=Grid Cell Number (cluster 119894)

sumGrid Cell Number

(4)

62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively

According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest

In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one

In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14

According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On

0

100

200

300

400

500

600

700

800

900

0 20000 40000 60000 80000 100000

K-meansHierarchicalDBScanXMeanEM

Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP

K-means)

Figure 14 Comparison of running time (in seconds) of differentsizes of dataset

the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally

In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one

In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC

The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets

From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage

International Journal of Distributed Sensor Networks 15

Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2

Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700

0

01

02

03

04

05

06

07

08

KM EM DBScan XM HC LP

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

CoverageCoverage

(a)

0

01

02

03

04

05

06

KM EM DBScan XM HC LP

Coverage

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

Coverage

(b)

Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2

Total density

0

200

400

600

800

1000

1200

1400

1600

KM EM DBScan XM HC LP

Density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(a)

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

KM EM DBScan XM HC LP

Density

Total density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(b)

Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2

16 International Journal of Distributed Sensor Networks

4

5136

8

1

Balance test on dataset 1

(a) KM

1

50

1

18

30

Balance test on dataset 1

(b) XM

6

22

24

30

18

Balance test on dataset 1

(c) EM

24

24

17

20

15

Balance test on dataset 1

(d) DBScan

18

17

22

19

25

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP

From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means

that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered

International Journal of Distributed Sensor Networks 17

17

18

17

24

24

Balance test on dataset 2

(a) KM

24

18

24

18

17

Balance test on dataset 2

(b) XM

47

032

47

Balance test on dataset 2

(c) EM

1010

98

Balance test on dataset 2

(d) DBScan

23

15

24

18

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP

The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely

63 Discussion of G119899119890119905 For all the six evaluation factors each

of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in

18 International Journal of Distributed Sensor Networks

Table 10 Numeric results of density of each cluster by using the six methods for dataset 1

Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049

Table 11 Numeric results of density of each cluster by using the six methods for dataset 2

Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447

order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866

119888can take a relatively very large value or even

1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes

119866119897=

10038161003816100381610038161003816100381610038161003816

LikelihoodTime

10038161003816100381610038161003816100381610038161003816

(5)

119866119887=Difference of Balance

Time (6)

119866119889=DensityTime

(7)

119866119888=CoverageTime

(8)

119866119900=OverlapTime

(9)

119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)

Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)

From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in

Table 12 which allows us to easily compare various methodsand performance aspects

In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866

119897 119866119887 119866119889 119866119888 and 119866

119900are computed

for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13

According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion

7 Conclusion and Future Works

Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for

International Journal of Distributed Sensor Networks 19

Table 12 Performance indicators of the six methods based on dataset 2

Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0

Table 13 Comparison of different clustering and LP methods by119866net indicator

Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132

purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users

The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes

For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be

good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed

References

[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000

[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012

[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012

[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012

[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012

[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003

[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006

[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002

[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008

[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006

[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004

[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998

[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000

20 International Journal of Distributed Sensor Networks

[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004

[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005

[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010

[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998

[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006

[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999

[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001

[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009

[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999

[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000

[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996

[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963

[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

4 International Journal of Distributed Sensor Networks

Real world Vector Rasterimage

Figure 1 Representation of how a real-world spatial area is represented by vector and raster encoding formats

13121110

987654321

0 1 2 3 4 5 6 7 8 9 10 11 12 13Width

Hei

ght

Colu

mn

Row

x y coordinates are 9 3

Figure 2 Vector format

80 74 62 45 45 34 39 56

80 74 74 62 45 34 39 56

74 74 62 62 45 34 39 39

62 62 45 45 34 34 34 39

45 45 45 34 34 30 34 39

Figure 3 Raster format in ordered list

collected spatial data In this process different methods aretested for choosing the one which covers themost area as wellas the highest feature values from the suggested clustersTheflow of this process including preprocessing of sensor data

515 519 521 523

523

523517 528 527

512 510 520

518511 512 516

514

517

511

510

512 516 517 520513

Figure 4 Raster data with center point

80

80

74

62

45

74

74

74

62

45

62

74

62

45

45

45

62

62

45

34

45

45

45

34

34

34

34

34

34

30

39

39

39

34

34

56

56

39

39

39

Figure 5 Raster format with 22 and 32 grids

data transformation clustering and finding cluster center-points is shown in Figure 6

In case of a satellite image or image captured by fighter-jetor other surveillance camera image processing is needed to

International Journal of Distributed Sensor Networks 5

Load spatial image

RGB image

Gray imageSkeleton extraction

Morphological operation in MATLAB (Bwmorph)

Zhangrsquos algorithmare used forcomparison

Two-tone image

Indexed grid image

2D special data

Method comparison

Griddingindexing image

Numerical dataset (with normalization)

Spatial grouping

Hierarchical K-means LPDBScan

Color map

Output

2 algorithms

middot middot middot

Preprocessing of imageData transformation

GroupingDisplay

Figure 6 Workflow of proposed methodology

extract the density information from pictures But in our caseof sensor network we can safely assume that the data fed fromanet of sensorswould have the sensor ID attachedThe sensorIDs are known so are their positions From the locations ofthe sensors and their sensor ID we could possibly relate thedata that was collected to their corresponding locations inthe 119909-119910 format of coordinates (assume the terrain is of 2D)In order to reduce the huge amount of calculation and storagespace a grid was used to divide the whole map into smallerpieces The grid indexing operation is repeated for a range ofdifferent coarse layers thereby providing different resolutionsof data partitions Similar technique is reported in [22] whichis computed by Euclidian distance Obviously the methodof grid indexing helps separate data into cells based on theirgeographic locations

To obtain a better result of spatial groups for maximumcoverage and its corresponding cluster center point with cer-tain constrains the research adopts several popular cluster-ing methods and linear programming method by using soft-ware programs such as XLMiner (httpwwwsolvercomxlminer-data-mining) MATLAB (httpwwwmathworkscomproductsmatlab) and Weka (httpwwwcswaikatoacnzmlweka)

The core purpose of cluster analysis is to comprehend andto distinguish the extent of similarity or dissimilarity amountof the independent clustered objects There are five majormethods of clusteringmdashKM EM XM HC and DBScan

119870-means (KM) byMacQueen 1967 is one of the simplestalgorithms that solve thewell-known clustering problem [23]It is an easy and simple method to divide a dataset into acertain number of clusters initially assuming that the numberof clusters is 119896 fixed a priori for each cluster which is themainidea The random choice of the initial location of centroids

leads to various results A better choice is to place them asmuch far away from each other as possible

The KM algorithm aims at minimizing an objective fun-ction In this case a squared error function is as follows

119895 = sum

forall119894

sum

forall119895

10038171003817100381710038171003817119909119894(119895) minus 119888

119895

10038171003817100381710038171003817

2

(1)

where 119895 ranges from 1 to 119896 119894 range from 1 to 119899 and119909119894(119895) minus 119888

1198952 is a chosen distance measure between a data

point 119909119894(119895) and the cluster center 119888

119895 which is an indicator of

the distance of the 119899 data points from their respective clustercenters The sum of distances or sum of squared Euclideandistances from the mean of each cluster is a quite normal orusual measure for causing scattering in all directions in thecluster in order to test the suitability of the KM algorithmClusters are often computed using a fast heuristic methodwhich generally produces good (but not necessarily optimal)solutions

X-Means [24] is an optimal method of KM whichimproves structure part in the algorithm Division of thecenters is attempted in its region It makes decision betweenthe root and children of each center to doing the comparisonbetween the two structures Another improved variant ofKM called EM which execrates maximization makes anassignment on a probability distribution to each further pointwhich represents the probability Howmany clusters to be setup are to be decided by EM using cross-validation

Density-based algorithms regard clusters as dense areasof objects that are separated by less dense areas [25] Becausethey have no limit to look for clusters with spherical shapethey can produce clusters with arbitrary shapes DBScan isa typical implementation of density-based algorithms calleddensity-based spatial clustering of applications with noise

6 International Journal of Distributed Sensor Networks

(a) (b) (c)

Figure 7 Illustration of possible ways of assigning clusters for maximum (a) fish population (b) altitude of terrain and (c) human inhabitantpopulation

[25] The notions of density reachability and density con-nectivity are used as performance indicators for the qualityof clustering [26] A cluster is composed of the group ofobjects in a dataset that are density connected to a particularcenter Any object that falls beyond a cluster is considered asnoise

Ward proposed a clustering method called hierarchicalclustering (HC) in 1963 [27] It tries to find how to formsomething to divide 119875

119899 119875119899minus1

1198751in a way that reduces

the relationship with each group In each step analysisstep it considered every possible cluster pair in group andcombined the two clusters with a very close joining of resultsin ldquoinformation lossrdquo which is given definition by Wardaround ESS (an error sum-of-squares criterion)The idea thatsupports Wardrsquos proposal can be described most simply bythinking of a little single data Take ten objects with scores asan example (2 7 6 6 7 2 2 0 2 0) The loss of informationwould be achieved by calculating ESS with a mean of 34which takes into account the ten scores as a unit as followsESS One group = (2 minus 34)

2+ (7 minus 34)

2+ sdot sdot sdot + (0 minus 34)

2=

4728 However those 10 objects can also be separated intofour groups according to their scores 0 0 0 2 2 2 26 6 and 7 7 Finally for evaluation of the ESS as a sum ofsquares we can obtain four independent error sums of eachsquare Overall the result that divides the 10 objects into 4clusters has no loss of information as follows

ESS One group = ESS group1 + ESS group2

+ ESS group3 + ESS group4 = 0

(2)

The last method we adopted here is linear programming(LP) which contains instituting and producing an answerto optimization problems with linear objective functionsand linear constraints This powerful tool can be used inmany fields especially where many options are possible inthe answers In spatial grouping over a large grid manypossible combinations of positioning the clusters exist Theproblem here is to find a certain number of clusters of

equal size over the area meanwhile the chosen centers ofthe clusters must allow sufficient distance apart from eachother so as to avoid overlapping As an example shownin Figure 7 three clusters would have to be assigned overa spatial area in a way that they would have to covercertain resources The assignment of the clusters howeverwould have to yield a maximum total value summed fromcovered resources In the example the left diagram showsallocating three clusters over the deepwater assuming that theresources are fish hence the maximum harvest The secondexample in the middle of Figure 7 is clustering the highaltitude over the area The last example is trying to coverthe maximum human inhabitants which are concentratedat the coves Given many possible ways of setting up theseclusters LP is used to formulate this allocation problemwith an objective of maximizing the values of the coveredresources

Assuming that the resources could be dynamic forexample animal herds or moving targets whose positionsmay swarm and change over time the optimization is atypical maximal flow problem (or max flow problem) Theoptimization is a type of network flow problem in whichthe goal is to determine the maximum amount of flowthat can occur over arc whish is limited by some capacityrestriction This type of network might be used to modelthe flow of oil in pipeline (in which the amount of oil thatcan flow through a pipe in a unit of time is limited by thediameter of the pipe) Traffic engineers also use this type ofnetwork to determine the maximum number of cars that cantravel through a collection of streets with different capacitiesimposed by the number of lanes in the streets and speed limits[28]

For our spatial clustering we consider each cell of the gridas a node each node is defined as a tuple119898 that contains thecoordinates and the value of the resource that is held in thenode such that 119898(119909

119894 119910119894 119911119894) represents an 119894th node in which

119909119894 119910119894represent the position and 119911

119894represents the value of

resource in the node respectively For the clusters each node

International Journal of Distributed Sensor Networks 7

(1) Load the grid-based spatial information into array 119860(119909 119910 119911) 119860 is a three dimensional array(2) Repeat (through all coordinates of 119909)(3) Repeat (through all coordinates of 119910)(4) If (boundary constraints and overlapping constraints are satisfied) Then(5) 119878(119909

119894 119910119894 119911119894) = 119860(119909

119894 119910119894 119911119894)

(6) End-if(7) End-loop(8) End-loop(9) If size of (119878) ge 119870

(10) Find top 119870 clusters where maxsum119911119894⨁119862119896 copy 119878(119909

119894 119910119894 119911119894) to new array 119862(119909

119894 119910119894 119911119894) forall119894 isin 119862

119896

(11) Else-if(12) 119862(119909

119894 119910119894 119911119894) = 119878(119909

119894 119910119894 119911119894) forall119894

(13) End-if

Pseudocode 1 Pseudocode of the proposed LP model for spatial clustering

can potentially be a center of a cluster and the cluster hasa fixed radius of length 119903 The LP model for our problem ismathematically shown as follows

Total value = ⋃

selected clusters ⟨119862119896|119896=1sdotsdotsdot119870⟩sum

119898119894isin119862119896

119898119894(lowast lowast 119911

119894)

= argmax119883119884

sum

0le119909119894le119883

0le119910119894le119884

119870

sum

119896=1

119911119897ni 119898119897(119909119894 119910119895 119911) oplus 119888

119896

(3)

Subject to the boundary constraints of 2r le |119909119894minus 119909119895| and 2r

le |119909119894minus 119909119895| for all 119894 and 119895 but 119894 = 119895 where119883 is the maximum

width and 119884 is the maximum length of the 2D spatial arearespectively 119896 isin 119870 is the maximum number of clusters and119888119896is the 119896th cluster under consideration in the optimizationIn order to implement the computation as depicted in

(3) for each node we sum each group resources in a shapeof diamond (which geometrically approximates a circle) Byiterating through every combination of119870 nodes in the grid ofsize 119883 by 119884 each current node in the combinations is beingtested by considering it as the center of a cluster that has aradius of r hence storing the resource values of the nodesfrom the potential clusters into a temporary array buffer119860(lowast lowast 119911

119894) The results from those potential clusters which

do satisfy the boundary and nonoverlapping constraints arethen copied to a candidate buffer Out of the clusters whoseresource values are stored in the candidate buffer 119878 thecombination of 119870 clusters that has the great total resourcevalue is selected and their values are placed in the final buffer119862 The corresponding pseudocode is shown in Pseudocode 1

5 Experimental Results and Analysis

In this section the performance of the proposed methodol-ogy is shown by presenting both numerical and visualizedresults for all performance aspects over various algorithms Acase study of road traffic is used in the experimentThe spatialarea is a metropolitan traffic map with roads and streetsspanning all over the place The resource value in this case isthe concentration or density of vehicle traffic flows Sensorsare assumed to have been deployed in every appropriate point

Table 1 Comparison between Bwmorph function and thinningalgorithm

Bwmorph function Thinning algorithmDataset 1 Dataset 2 Dataset 1 Dataset 2

Degree ofthinning Incomplete Complete

Elapsed time(secs) 20 38 100 198

Complexity 119874(119899) 119874(1198992)

of the roads thereby a typical traffic volume is each of thesepoints is known The optimization of spatial clustering inthis case can be thought of as optimal resource allocationfor example cost-effective police patrols gas stations orenvironment-pollution controls are needed among thosedense traffic spots

51 Data Preprocessing Two different factual datasets areused for experiments The first dataset is published byMaricopa Association of Governments in 2008 which isa traffic volume map Traffic volumes were derived fromthe national traffic recording devices Seasonal variation isfactored into the volumes The second dataset is an annualaverage daily traffic of Baltimore County Traffic VolumeMapin 2011 in USA prepared by the Maryland Department ofTransportation and published by March 19 2012 The trafficcount estimates are derived by taking 48-hourmachine countdata and applying factors frompermanent count stationsThetraffic counts represent the resource values in a general sense

After using skeleton extraction a two-tone image wasobtained from the original map Readers are referred to therespective websites where they can see the traffic volume datathat are associated with our two datasets (a) Representativetraffic volume map of dataset 1mdashTraffic Volume Map ofPhoenix AZUSA (httpphoenixgovstreetstrafficvolume-map) (b) Representative traffic volume map of dataset2mdashTraffic Volume Map of Baltimore MD USA (httpwwwmarylandroadscomTraffic Volume MapsTraffic VolumeMapspdf) And the corresponding result skeleton extraction

8 International Journal of Distributed Sensor Networks

Table 2 Important statistics from the clustering and LP experiments

Method Cluster number Number of cells covered Minimum Maximum Overlap

KM

Cluster 1 428 0 3499327 0Cluster 2 468 0 546896 0Cluster 3 448 0 20503007 0Cluster 4 614 0 6894667 0Cluster 5 618 0 900908 0

XM

Cluster 1 615 0 591265 0Cluster 2 457 0 546896 0Cluster 3 609 0 900908 0Cluster 4 465 0 3499327 0Cluster 5 430 0 20503007 0

EM

Cluster 1 1223 0 2292 61817229Cluster 2 7 141048 243705 313018Cluster 3 81 0 3033733 131146577Cluster 4 64 26752 546896 330881249Cluster 5 1201 0 1300026 217950471

DB

Cluster 1 13 23614 33146 327222911Cluster 2 11 1686825 21001 363965818Cluster 3 13 178888 2945283 196118393Cluster 4 11 847733 211008 58940877Cluster 5 2528 0 546896 20554176

HC

Cluster 1 291 0 3499327 0Cluster 2 191 0 20503007 96762283Cluster 3 294 0 1590971 0Cluster 4 224 0 189812 12673555Cluster 5 243 0 546896 0

LP

Cluster 1 221 0 3499327 0Cluster 2 221 0 20503007 0Cluster 3 221 0 1590971 0Cluster 4 221 0 189812 0Cluster 5 221 0 546896 0

Table 3 Comparison for running time of the first dataset

Formats KM HC DBscan XM EM LPVector database 327 1252 2324 278 930 183Raster database 342 1536 2820 284 984 201RasterP (16 grids) 198 134 508 046 057 078RasterP (25 grids) 009 014 115 021 012 053

in dataset 1 is shown in Figure 8 where (a) adopted a kind ofmorphological operation method and (b) adopted thinningalgorithm respectively Likewise the corresponding resultskeleton extraction in the second dataset is shown inFigure 9 where (a) adopted a kind of morphologicaloperation method and (b) adopted thinning algorithmrespectively The comparison result of the two datasets isshown in Table 1

For the raw dataset we firstly perform the image prepro-cessing over it to obtain numerical database

The results of the skeleton extraction as shown in Figures8(b) and 9(b) are more clearly and useful for the following

processing Subsequently the clustering by grid can bereadily obtained from the preprocessed imagesThe extent ofimage thinning is better and more complete by the thinningalgorithm than the Bwmorph function in MATLAB But theelapsed time is longer due to a two-layer iteration nestingprocedure in the program code

The choice of placing a grid on the image follows oneprinciple mesh segmentation is not trying to fall on a con-centrated position of traffic flow Since there is no endpointthe midpoint of the two adjacent values was considered ademarcation point Under this assumption the traffic flow ineach grid is calculated and stored digitally in an Excel file Adigital data for the trafficmap serves as the initial data for thesubsequent clustering process

52 Comparison Result of KM and HC Clustering InXLMiner two methods were used to perform clustering KMand HC In order to compare the two methods for the twodatasets input variables were normalized and the numberof clusters is set at five and maximum iterations at 100 Theinitial centroids are chosen randomly at start Furthermore

International Journal of Distributed Sensor Networks 9

Table 4 Comparison for log-likelihood of first dataset

Formats KM HC DBScan XM EMVector database minus1241868 minus1407265 minus1328599 minus119533 minus1249562Raster database minus1342238 minus1502863 minus1378889 minus129632 minus1339769RasterP (16 grids) 1262264 minus1402266 minus1248583 minus1239419 minus1244993RasterP (25 grids) minus1241868 minus1319417 minus1122207 minus1248201 minus1162048

(a) (b)

Figure 8 (a) Result of skeleton extraction in dataset 1 using Bwmorph function (b) Result of skeleton extraction in dataset 1 using thinningalgorithm

(a) (b)

Figure 9 (a) Result of skeleton extraction in dataset 2 using Bwmorph function (b) Result of skeleton extraction in dataset 2 using thinningalgorithm

Table 5 Comparison for running time of the second dataset

Formats KM HC DBScan XM EM LPVector database 139 134 1553 153 1005 337Raster database 241 1478 1834 217 823 196RasterP (16 grids) 047 801 1274 045 377 144RasterP (25 grids) 035 620 1098 036 296 118

the weights for the corresponding three attributes (119909 119910 V)for each grid (119892

119894= (119909119894 119910119894 V119894)) based on defining weight of

119909 and 119910 could be varied (fine-tuned) and the sum of weightsmust be equal to 1 We tested several variations searching forthe best clustering results (1) weight of V is 20 (2) weightof V is 40 (3) weight of V is 50 (4) weight of V is 60 (5)weight of V is 80 (6) all of them have same weight at 333

(7) weight of V is 0 (8) same weight except when 119892119894(V119894= 0)

and (9) weights of 119909 and 119910 are both 0 except when 119892119894(V119894= 0)

In HC method normalization of the input data waschosen Another option available is similarity measure Itadopts Euclidean distance to measure raw numeric dataMeanwhile the other two options Jaccardrsquos coefficients andmatching coefficient are activated only when the data isbinary

For the above nine cases results of cases (1) to (6) aresimilar in their separate methods And result of (9) is theworst which does not accomplish any clustering Results ofcases (2) (3) (7) and (8) are demonstrated in Figure 10

For the distribution of clusters in the result of KMclustering method more than half of data points are clampedinto one oversized cluster The result of this method istherefore not helpful for further operation For HC methoddata on average are allocated into separate clustersThe result

10 International Journal of Distributed Sensor Networks

Table 6 Comparison for log-likelihood of second dataset

Formats KM HC DBScan XM EMVector database minus1735412 minus1962367 minus1753576 minus1721513 minus1657263Raster database minus1815926 minus2012568 minus1970756 minus1815791 minus1848209RasterP (16 grids) minus1551437 minus1724736 minus1637147 minus1701283 minus1566231RasterP (25 grids) minus1484761 minus1663789 minus1509146 minus1667312 minus1647823

1

2

5

4

41

3

32

5

(a)

1

2 3

4

5

4 51

3

2

(b)

5

5

4

1

4

1 3

3

2

2

5

(c)

2

2

4

3

3

1 5

5

4

1

(d)

Figure 10 (a) Clustering results for the first dataset with setting case (2) where weight of V is 40 top half uses KM clustering methodand bottom half uses HC method (b) Clustering results for the first dataset with setting case (3) where weight of V is 50 top half uses KMclustering method and bottom half uses HC method (c) Clustering results for the first dataset with setting case (7) where weight of V is 0top half uses KM clustering method and bottom half uses HC method (d) Clustering results for the first dataset with setting case (3) whereall share the same weight except 119892

119894(V119894= 0) top half uses KM clustering method and bottom half uses HC method

Table 7 Comparison of running time (in seconds) of four differentsizes of dataset

Dataset size KM HC DBScan XM EM LP100 grid cells 006 007 105 219 321 0184600 grid cells 042 295 3989 273 1905 93710000 grid cells 262 4667 9755 297 3785 242180000 grid cells 1975 18961 684 647 19831 9083

in Figure 10(c) is the best showing only the one with distinctposition attributes (119909 and 119910) The other three results (Figures10(a) 10(b) and 10(d)) are stained with cluster overlapsTherefore allocation of critical resource for example in eachcluster may result in a waste of resources The degree ofoverlap is the least in the result of Figure 10(b) If only locationis being considered the result of Figure 10(c) is the bestchoice Otherwise the result in Figure 10(b) is better than theother two for the sake of cluster distribution

The clustering results of the second dataset performanceby using the two methods KM and HC are shown inFigure 11

From the results of the cluster distribution of the seconddataset obtained by both clustering methods the size of eachcluster is more or less similar which is better than that ofthe first dataset And there is no overlap phenomenon inthe KM results This is a promising feature of KM methodfor spatial clustering However there is little overlap in theresult of HC method as the clusters seem to take irregularshapes Above all for the second dataset KM is a better choicefor consideration of even cluster distribution and overlapavoidance by using both clustering methods

53 Results of Grouping In this part we compare the coloredmap of Raster (119909 119910 V) data model in two datasets usingfive clustering methods in Weka and the LP method Thecommon requirement is no overlap for each of the resultingmaps The number of cluster is arbitrarily chosen at five The

International Journal of Distributed Sensor Networks 11

Table 8 Numeric results of coverage of each cluster by using the six methods for dataset 1

Cov-db1 KM EM DBScan XM HC LPCluster 0 0029436 0003786 0017902 0075178 0013153 0028985Cluster 1 0301538 0269602 0208078 0049761 0026016 0377034Cluster 2 0215277 0001627 0158439 0084049 012436 0080099Cluster 3 0046788 0096221 0079177 0209390 0001172 0217204Cluster 4 0002712 0161799 0044197 0043152 03043 0007704Total coverage 0595751 0533036 0507793 0461531 0469 0711025

41 5

3 2

(a)

4

13

5

2

(b)

Figure 11 (a) Clustering results for the second dataset by usingKMmethod (b) Clustering results for the second dataset by usingHCmethod

result of first dataset is shown in Figure 12 The first part (i)of Figure 12 shows the spatial clustering result the secondpart (ii) visualizes the corresponding spatial groups by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method The centers of the clustersare computed after clustering is done and then the groupsare visualized over the clustering results according to thecomputed centers

In Figure 12 for the results of (a) and (e) the sizes ofclusters are quite uneven more than half of dataset fall intoone cluster Thus this result reveals a fact that the techniquecannot organize a dataset into homogeneous andor well-separated groups with respect to a distance or equivalentlya similarity measureThe corresponding groups have overlapphenomenon too For the result of (c) the sizes of the clustersare uneven too For the result of (b) and (d) the sizes of clusterseem to be similar to each other There is also no overlapin the clustering result but for group result the groups in(d) have far more overlaps than those in (b) Overlap meanssome part or the cluster gets in the way of another onewhichmeans that there is superposition between two ormoredifferent clusters Again itmay cause resourcewaste and evenfalse allocation This situation occurs in important fields ofapplications such as information retrieval (several thematicfor a single document) and biological data (several metabolicfunctions for one gene) For this reason (b) is better than (d)According to the above analysis for the result of clusteringand corresponding groups (d) XM is so far the best choiceof clustering algorithm as evidenced by the colored mapsthereafter

With the same experiment setup and operating environ-ment the spatial clustering experiments are performed overthe second dataset The results of second dataset are shown

in Figure 13 where (i) represents the spatial clustering resultand (ii) represents the corresponding spatial group by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method

In Figures 13(a) and 13(e) it is noticed that the clustersare imbalanced and there are overlaps in the correspondingspatial groups using the method of (a) EM and (e) DBScanThe results of (b) KM and (d) XM however avoid theshortcomings of (a) and (e) though they still have slightoverlaps For (c) HC we remove the empty cells in theboundary to reduce the size of dataset the clustering resultis perfect There is no overlap and clusters are balancedbetween each other But there is still overlap in the spatialgroups Thus LP method is adopted to solve this problemand in possession of same size of groups The result of LPmethod yields perfectly balanced groups without any overlapas shown in Figure 13(f)

By visually comparing the clustering results of the twodatasets the clustering results seem to be similar but thespatial groups are somewhat differentOccurrence of overlapsin spatial groups is more severe in the first dataset than in thesecond one The overlaps are likely due to data distributionand balance in sizes between each cluster For the first datasetthe heavy spatial values which are traffic volumes in this caseare mainly concentrated in the center region (city center) solocations of the computed clusters tend to cram very near ata crowded spot In contrast the traffic volumes in the seconddataset are dispersed over a large area As seen from the visualspatial groups of the second dataset the cluster positions area little far apart when compared to those in the first dataset

Based on the results generated from the clustering andLP experiments some statistic information of dataset 2 iscollected and it is shown in Table 2 The numeric results in

12 International Journal of Distributed Sensor Networks

(i)

(ii)

(a)

(i)

(ii)

(b)

(i)

(ii)

(c)

(ii)

(i)

(d)

(i)

(ii)

(e)

Figure 12 (a) (i) Spatial clustering on dataset 1 by using EM (ii) Spatial groups on dataset 1 from the results of using EM (b) (i) spatialclustering on dataset 1 by using KM (ii) Spatial groups on dataset 1 from the results of using KM (c) (i) spatial clustering on dataset 1 byusing HC (ii) Spatial groups on dataset 1 from the results of using HC (d) (i) spatial clustering on dataset 1 by using XM (ii) Spatial groupson dataset 1 from the results of using XM (e) (i) spatial clustering on dataset 1 by using DBScan (ii) Spatial group on dataset 1 from theresults of using DBScan

Table 3 support the qualitative analysis by visual inspectionin the previous section By comparingHC and LPmethods asan example the quantitative results show that they have thegreatest differences in cell numbers covered by the clustersalso the amount of overlap in HC is the highest of all By theLP method the size of each cluster is exactly the same andthey are totally free from overlap

6 Technical Analysis of Clustering Results

61 Experimental Evaluation Method For the purpose ofassessing how the qualities of spatial groups from clusteringare several evaluation factors are defined here running time(short for time) balance log-likelihood overlap densityand coverage For a fair comparison the datasets are run inthe same software environment of the same computer And

assume the number of groups to be five with six differentmethods Running time is the time we used to run eachmethod using the same software in the same computerto completion Balance is used to measure the sizes ofgroups if balanced the size of each group is the same Log-likelihood is an important measure for clustering qualitythe bigger the value the better Weka tests for goodness-of-fit by the likelihood in logarithm which is called log-likelihood A large log-likelihood means that the clusteringmodel is suitable for the data under test Overlap means thatthe spatial values (eg traffic volumes sensed by the sensors)do belong to more than one cluster Density is the averagespatial values (traffic volumes) per grid cell in each clusterCoverage of a cluster means the proportion of traffic volumesthat are covered by the grid cells within the cluster overthe whole dataset meanwhile total coverage is the sum of

International Journal of Distributed Sensor Networks 13

(i)

(ii)

(a)

(i)

(ii)

(b)

(i)

(ii)

(c)

(i)

(ii)

(d)

(i)

(ii)

(e)

(f)

Figure 13 (a) (i) spatial clustering on dataset 2 by using EM (ii) Spatial groups on dataset 2 from the results of using EM (b) (i) spatialclustering on dataset 2 by using KM (ii) Spatial groups on dataset 2 from the results of using KM (c) (i) spatial clustering on dataset 2 byusing HC (ii) Spatial groups on dataset 2 from the results of using HC (d) (i) spatial clustering on dataset 2 by using XM (ii) Spatial groupson dataset 2 from the results of using XM (e) (i) spatial clustering on dataset 2 by using DBScan (ii) Spatial group on dataset 2 from theresults of using DBScan (f) Spatial group in LP method on dataset 2

14 International Journal of Distributed Sensor Networks

traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below

Density (cluster 119894) =sumTraffic Volumes (cluster 119894)

Grid Cell Number (cluster 119894)

Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)

sumGrid Cell Number

Total Coverage = sumTraffic Volumes minusOverlaps

Proportion of Cluster (119894) Size (Balance)

=Grid Cell Number (cluster 119894)

sumGrid Cell Number

(4)

62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively

According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest

In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one

In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14

According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On

0

100

200

300

400

500

600

700

800

900

0 20000 40000 60000 80000 100000

K-meansHierarchicalDBScanXMeanEM

Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP

K-means)

Figure 14 Comparison of running time (in seconds) of differentsizes of dataset

the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally

In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one

In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC

The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets

From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage

International Journal of Distributed Sensor Networks 15

Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2

Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700

0

01

02

03

04

05

06

07

08

KM EM DBScan XM HC LP

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

CoverageCoverage

(a)

0

01

02

03

04

05

06

KM EM DBScan XM HC LP

Coverage

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

Coverage

(b)

Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2

Total density

0

200

400

600

800

1000

1200

1400

1600

KM EM DBScan XM HC LP

Density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(a)

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

KM EM DBScan XM HC LP

Density

Total density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(b)

Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2

16 International Journal of Distributed Sensor Networks

4

5136

8

1

Balance test on dataset 1

(a) KM

1

50

1

18

30

Balance test on dataset 1

(b) XM

6

22

24

30

18

Balance test on dataset 1

(c) EM

24

24

17

20

15

Balance test on dataset 1

(d) DBScan

18

17

22

19

25

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP

From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means

that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered

International Journal of Distributed Sensor Networks 17

17

18

17

24

24

Balance test on dataset 2

(a) KM

24

18

24

18

17

Balance test on dataset 2

(b) XM

47

032

47

Balance test on dataset 2

(c) EM

1010

98

Balance test on dataset 2

(d) DBScan

23

15

24

18

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP

The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely

63 Discussion of G119899119890119905 For all the six evaluation factors each

of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in

18 International Journal of Distributed Sensor Networks

Table 10 Numeric results of density of each cluster by using the six methods for dataset 1

Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049

Table 11 Numeric results of density of each cluster by using the six methods for dataset 2

Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447

order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866

119888can take a relatively very large value or even

1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes

119866119897=

10038161003816100381610038161003816100381610038161003816

LikelihoodTime

10038161003816100381610038161003816100381610038161003816

(5)

119866119887=Difference of Balance

Time (6)

119866119889=DensityTime

(7)

119866119888=CoverageTime

(8)

119866119900=OverlapTime

(9)

119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)

Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)

From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in

Table 12 which allows us to easily compare various methodsand performance aspects

In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866

119897 119866119887 119866119889 119866119888 and 119866

119900are computed

for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13

According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion

7 Conclusion and Future Works

Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for

International Journal of Distributed Sensor Networks 19

Table 12 Performance indicators of the six methods based on dataset 2

Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0

Table 13 Comparison of different clustering and LP methods by119866net indicator

Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132

purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users

The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes

For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be

good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed

References

[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000

[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012

[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012

[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012

[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012

[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003

[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006

[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002

[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008

[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006

[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004

[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998

[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000

20 International Journal of Distributed Sensor Networks

[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004

[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005

[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010

[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998

[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006

[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999

[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001

[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009

[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999

[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000

[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996

[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963

[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

International Journal of Distributed Sensor Networks 5

Load spatial image

RGB image

Gray imageSkeleton extraction

Morphological operation in MATLAB (Bwmorph)

Zhangrsquos algorithmare used forcomparison

Two-tone image

Indexed grid image

2D special data

Method comparison

Griddingindexing image

Numerical dataset (with normalization)

Spatial grouping

Hierarchical K-means LPDBScan

Color map

Output

2 algorithms

middot middot middot

Preprocessing of imageData transformation

GroupingDisplay

Figure 6 Workflow of proposed methodology

extract the density information from pictures But in our caseof sensor network we can safely assume that the data fed fromanet of sensorswould have the sensor ID attachedThe sensorIDs are known so are their positions From the locations ofthe sensors and their sensor ID we could possibly relate thedata that was collected to their corresponding locations inthe 119909-119910 format of coordinates (assume the terrain is of 2D)In order to reduce the huge amount of calculation and storagespace a grid was used to divide the whole map into smallerpieces The grid indexing operation is repeated for a range ofdifferent coarse layers thereby providing different resolutionsof data partitions Similar technique is reported in [22] whichis computed by Euclidian distance Obviously the methodof grid indexing helps separate data into cells based on theirgeographic locations

To obtain a better result of spatial groups for maximumcoverage and its corresponding cluster center point with cer-tain constrains the research adopts several popular cluster-ing methods and linear programming method by using soft-ware programs such as XLMiner (httpwwwsolvercomxlminer-data-mining) MATLAB (httpwwwmathworkscomproductsmatlab) and Weka (httpwwwcswaikatoacnzmlweka)

The core purpose of cluster analysis is to comprehend andto distinguish the extent of similarity or dissimilarity amountof the independent clustered objects There are five majormethods of clusteringmdashKM EM XM HC and DBScan

119870-means (KM) byMacQueen 1967 is one of the simplestalgorithms that solve thewell-known clustering problem [23]It is an easy and simple method to divide a dataset into acertain number of clusters initially assuming that the numberof clusters is 119896 fixed a priori for each cluster which is themainidea The random choice of the initial location of centroids

leads to various results A better choice is to place them asmuch far away from each other as possible

The KM algorithm aims at minimizing an objective fun-ction In this case a squared error function is as follows

119895 = sum

forall119894

sum

forall119895

10038171003817100381710038171003817119909119894(119895) minus 119888

119895

10038171003817100381710038171003817

2

(1)

where 119895 ranges from 1 to 119896 119894 range from 1 to 119899 and119909119894(119895) minus 119888

1198952 is a chosen distance measure between a data

point 119909119894(119895) and the cluster center 119888

119895 which is an indicator of

the distance of the 119899 data points from their respective clustercenters The sum of distances or sum of squared Euclideandistances from the mean of each cluster is a quite normal orusual measure for causing scattering in all directions in thecluster in order to test the suitability of the KM algorithmClusters are often computed using a fast heuristic methodwhich generally produces good (but not necessarily optimal)solutions

X-Means [24] is an optimal method of KM whichimproves structure part in the algorithm Division of thecenters is attempted in its region It makes decision betweenthe root and children of each center to doing the comparisonbetween the two structures Another improved variant ofKM called EM which execrates maximization makes anassignment on a probability distribution to each further pointwhich represents the probability Howmany clusters to be setup are to be decided by EM using cross-validation

Density-based algorithms regard clusters as dense areasof objects that are separated by less dense areas [25] Becausethey have no limit to look for clusters with spherical shapethey can produce clusters with arbitrary shapes DBScan isa typical implementation of density-based algorithms calleddensity-based spatial clustering of applications with noise

6 International Journal of Distributed Sensor Networks

(a) (b) (c)

Figure 7 Illustration of possible ways of assigning clusters for maximum (a) fish population (b) altitude of terrain and (c) human inhabitantpopulation

[25] The notions of density reachability and density con-nectivity are used as performance indicators for the qualityof clustering [26] A cluster is composed of the group ofobjects in a dataset that are density connected to a particularcenter Any object that falls beyond a cluster is considered asnoise

Ward proposed a clustering method called hierarchicalclustering (HC) in 1963 [27] It tries to find how to formsomething to divide 119875

119899 119875119899minus1

1198751in a way that reduces

the relationship with each group In each step analysisstep it considered every possible cluster pair in group andcombined the two clusters with a very close joining of resultsin ldquoinformation lossrdquo which is given definition by Wardaround ESS (an error sum-of-squares criterion)The idea thatsupports Wardrsquos proposal can be described most simply bythinking of a little single data Take ten objects with scores asan example (2 7 6 6 7 2 2 0 2 0) The loss of informationwould be achieved by calculating ESS with a mean of 34which takes into account the ten scores as a unit as followsESS One group = (2 minus 34)

2+ (7 minus 34)

2+ sdot sdot sdot + (0 minus 34)

2=

4728 However those 10 objects can also be separated intofour groups according to their scores 0 0 0 2 2 2 26 6 and 7 7 Finally for evaluation of the ESS as a sum ofsquares we can obtain four independent error sums of eachsquare Overall the result that divides the 10 objects into 4clusters has no loss of information as follows

ESS One group = ESS group1 + ESS group2

+ ESS group3 + ESS group4 = 0

(2)

The last method we adopted here is linear programming(LP) which contains instituting and producing an answerto optimization problems with linear objective functionsand linear constraints This powerful tool can be used inmany fields especially where many options are possible inthe answers In spatial grouping over a large grid manypossible combinations of positioning the clusters exist Theproblem here is to find a certain number of clusters of

equal size over the area meanwhile the chosen centers ofthe clusters must allow sufficient distance apart from eachother so as to avoid overlapping As an example shownin Figure 7 three clusters would have to be assigned overa spatial area in a way that they would have to covercertain resources The assignment of the clusters howeverwould have to yield a maximum total value summed fromcovered resources In the example the left diagram showsallocating three clusters over the deepwater assuming that theresources are fish hence the maximum harvest The secondexample in the middle of Figure 7 is clustering the highaltitude over the area The last example is trying to coverthe maximum human inhabitants which are concentratedat the coves Given many possible ways of setting up theseclusters LP is used to formulate this allocation problemwith an objective of maximizing the values of the coveredresources

Assuming that the resources could be dynamic forexample animal herds or moving targets whose positionsmay swarm and change over time the optimization is atypical maximal flow problem (or max flow problem) Theoptimization is a type of network flow problem in whichthe goal is to determine the maximum amount of flowthat can occur over arc whish is limited by some capacityrestriction This type of network might be used to modelthe flow of oil in pipeline (in which the amount of oil thatcan flow through a pipe in a unit of time is limited by thediameter of the pipe) Traffic engineers also use this type ofnetwork to determine the maximum number of cars that cantravel through a collection of streets with different capacitiesimposed by the number of lanes in the streets and speed limits[28]

For our spatial clustering we consider each cell of the gridas a node each node is defined as a tuple119898 that contains thecoordinates and the value of the resource that is held in thenode such that 119898(119909

119894 119910119894 119911119894) represents an 119894th node in which

119909119894 119910119894represent the position and 119911

119894represents the value of

resource in the node respectively For the clusters each node

International Journal of Distributed Sensor Networks 7

(1) Load the grid-based spatial information into array 119860(119909 119910 119911) 119860 is a three dimensional array(2) Repeat (through all coordinates of 119909)(3) Repeat (through all coordinates of 119910)(4) If (boundary constraints and overlapping constraints are satisfied) Then(5) 119878(119909

119894 119910119894 119911119894) = 119860(119909

119894 119910119894 119911119894)

(6) End-if(7) End-loop(8) End-loop(9) If size of (119878) ge 119870

(10) Find top 119870 clusters where maxsum119911119894⨁119862119896 copy 119878(119909

119894 119910119894 119911119894) to new array 119862(119909

119894 119910119894 119911119894) forall119894 isin 119862

119896

(11) Else-if(12) 119862(119909

119894 119910119894 119911119894) = 119878(119909

119894 119910119894 119911119894) forall119894

(13) End-if

Pseudocode 1 Pseudocode of the proposed LP model for spatial clustering

can potentially be a center of a cluster and the cluster hasa fixed radius of length 119903 The LP model for our problem ismathematically shown as follows

Total value = ⋃

selected clusters ⟨119862119896|119896=1sdotsdotsdot119870⟩sum

119898119894isin119862119896

119898119894(lowast lowast 119911

119894)

= argmax119883119884

sum

0le119909119894le119883

0le119910119894le119884

119870

sum

119896=1

119911119897ni 119898119897(119909119894 119910119895 119911) oplus 119888

119896

(3)

Subject to the boundary constraints of 2r le |119909119894minus 119909119895| and 2r

le |119909119894minus 119909119895| for all 119894 and 119895 but 119894 = 119895 where119883 is the maximum

width and 119884 is the maximum length of the 2D spatial arearespectively 119896 isin 119870 is the maximum number of clusters and119888119896is the 119896th cluster under consideration in the optimizationIn order to implement the computation as depicted in

(3) for each node we sum each group resources in a shapeof diamond (which geometrically approximates a circle) Byiterating through every combination of119870 nodes in the grid ofsize 119883 by 119884 each current node in the combinations is beingtested by considering it as the center of a cluster that has aradius of r hence storing the resource values of the nodesfrom the potential clusters into a temporary array buffer119860(lowast lowast 119911

119894) The results from those potential clusters which

do satisfy the boundary and nonoverlapping constraints arethen copied to a candidate buffer Out of the clusters whoseresource values are stored in the candidate buffer 119878 thecombination of 119870 clusters that has the great total resourcevalue is selected and their values are placed in the final buffer119862 The corresponding pseudocode is shown in Pseudocode 1

5 Experimental Results and Analysis

In this section the performance of the proposed methodol-ogy is shown by presenting both numerical and visualizedresults for all performance aspects over various algorithms Acase study of road traffic is used in the experimentThe spatialarea is a metropolitan traffic map with roads and streetsspanning all over the place The resource value in this case isthe concentration or density of vehicle traffic flows Sensorsare assumed to have been deployed in every appropriate point

Table 1 Comparison between Bwmorph function and thinningalgorithm

Bwmorph function Thinning algorithmDataset 1 Dataset 2 Dataset 1 Dataset 2

Degree ofthinning Incomplete Complete

Elapsed time(secs) 20 38 100 198

Complexity 119874(119899) 119874(1198992)

of the roads thereby a typical traffic volume is each of thesepoints is known The optimization of spatial clustering inthis case can be thought of as optimal resource allocationfor example cost-effective police patrols gas stations orenvironment-pollution controls are needed among thosedense traffic spots

51 Data Preprocessing Two different factual datasets areused for experiments The first dataset is published byMaricopa Association of Governments in 2008 which isa traffic volume map Traffic volumes were derived fromthe national traffic recording devices Seasonal variation isfactored into the volumes The second dataset is an annualaverage daily traffic of Baltimore County Traffic VolumeMapin 2011 in USA prepared by the Maryland Department ofTransportation and published by March 19 2012 The trafficcount estimates are derived by taking 48-hourmachine countdata and applying factors frompermanent count stationsThetraffic counts represent the resource values in a general sense

After using skeleton extraction a two-tone image wasobtained from the original map Readers are referred to therespective websites where they can see the traffic volume datathat are associated with our two datasets (a) Representativetraffic volume map of dataset 1mdashTraffic Volume Map ofPhoenix AZUSA (httpphoenixgovstreetstrafficvolume-map) (b) Representative traffic volume map of dataset2mdashTraffic Volume Map of Baltimore MD USA (httpwwwmarylandroadscomTraffic Volume MapsTraffic VolumeMapspdf) And the corresponding result skeleton extraction

8 International Journal of Distributed Sensor Networks

Table 2 Important statistics from the clustering and LP experiments

Method Cluster number Number of cells covered Minimum Maximum Overlap

KM

Cluster 1 428 0 3499327 0Cluster 2 468 0 546896 0Cluster 3 448 0 20503007 0Cluster 4 614 0 6894667 0Cluster 5 618 0 900908 0

XM

Cluster 1 615 0 591265 0Cluster 2 457 0 546896 0Cluster 3 609 0 900908 0Cluster 4 465 0 3499327 0Cluster 5 430 0 20503007 0

EM

Cluster 1 1223 0 2292 61817229Cluster 2 7 141048 243705 313018Cluster 3 81 0 3033733 131146577Cluster 4 64 26752 546896 330881249Cluster 5 1201 0 1300026 217950471

DB

Cluster 1 13 23614 33146 327222911Cluster 2 11 1686825 21001 363965818Cluster 3 13 178888 2945283 196118393Cluster 4 11 847733 211008 58940877Cluster 5 2528 0 546896 20554176

HC

Cluster 1 291 0 3499327 0Cluster 2 191 0 20503007 96762283Cluster 3 294 0 1590971 0Cluster 4 224 0 189812 12673555Cluster 5 243 0 546896 0

LP

Cluster 1 221 0 3499327 0Cluster 2 221 0 20503007 0Cluster 3 221 0 1590971 0Cluster 4 221 0 189812 0Cluster 5 221 0 546896 0

Table 3 Comparison for running time of the first dataset

Formats KM HC DBscan XM EM LPVector database 327 1252 2324 278 930 183Raster database 342 1536 2820 284 984 201RasterP (16 grids) 198 134 508 046 057 078RasterP (25 grids) 009 014 115 021 012 053

in dataset 1 is shown in Figure 8 where (a) adopted a kind ofmorphological operation method and (b) adopted thinningalgorithm respectively Likewise the corresponding resultskeleton extraction in the second dataset is shown inFigure 9 where (a) adopted a kind of morphologicaloperation method and (b) adopted thinning algorithmrespectively The comparison result of the two datasets isshown in Table 1

For the raw dataset we firstly perform the image prepro-cessing over it to obtain numerical database

The results of the skeleton extraction as shown in Figures8(b) and 9(b) are more clearly and useful for the following

processing Subsequently the clustering by grid can bereadily obtained from the preprocessed imagesThe extent ofimage thinning is better and more complete by the thinningalgorithm than the Bwmorph function in MATLAB But theelapsed time is longer due to a two-layer iteration nestingprocedure in the program code

The choice of placing a grid on the image follows oneprinciple mesh segmentation is not trying to fall on a con-centrated position of traffic flow Since there is no endpointthe midpoint of the two adjacent values was considered ademarcation point Under this assumption the traffic flow ineach grid is calculated and stored digitally in an Excel file Adigital data for the trafficmap serves as the initial data for thesubsequent clustering process

52 Comparison Result of KM and HC Clustering InXLMiner two methods were used to perform clustering KMand HC In order to compare the two methods for the twodatasets input variables were normalized and the numberof clusters is set at five and maximum iterations at 100 Theinitial centroids are chosen randomly at start Furthermore

International Journal of Distributed Sensor Networks 9

Table 4 Comparison for log-likelihood of first dataset

Formats KM HC DBScan XM EMVector database minus1241868 minus1407265 minus1328599 minus119533 minus1249562Raster database minus1342238 minus1502863 minus1378889 minus129632 minus1339769RasterP (16 grids) 1262264 minus1402266 minus1248583 minus1239419 minus1244993RasterP (25 grids) minus1241868 minus1319417 minus1122207 minus1248201 minus1162048

(a) (b)

Figure 8 (a) Result of skeleton extraction in dataset 1 using Bwmorph function (b) Result of skeleton extraction in dataset 1 using thinningalgorithm

(a) (b)

Figure 9 (a) Result of skeleton extraction in dataset 2 using Bwmorph function (b) Result of skeleton extraction in dataset 2 using thinningalgorithm

Table 5 Comparison for running time of the second dataset

Formats KM HC DBScan XM EM LPVector database 139 134 1553 153 1005 337Raster database 241 1478 1834 217 823 196RasterP (16 grids) 047 801 1274 045 377 144RasterP (25 grids) 035 620 1098 036 296 118

the weights for the corresponding three attributes (119909 119910 V)for each grid (119892

119894= (119909119894 119910119894 V119894)) based on defining weight of

119909 and 119910 could be varied (fine-tuned) and the sum of weightsmust be equal to 1 We tested several variations searching forthe best clustering results (1) weight of V is 20 (2) weightof V is 40 (3) weight of V is 50 (4) weight of V is 60 (5)weight of V is 80 (6) all of them have same weight at 333

(7) weight of V is 0 (8) same weight except when 119892119894(V119894= 0)

and (9) weights of 119909 and 119910 are both 0 except when 119892119894(V119894= 0)

In HC method normalization of the input data waschosen Another option available is similarity measure Itadopts Euclidean distance to measure raw numeric dataMeanwhile the other two options Jaccardrsquos coefficients andmatching coefficient are activated only when the data isbinary

For the above nine cases results of cases (1) to (6) aresimilar in their separate methods And result of (9) is theworst which does not accomplish any clustering Results ofcases (2) (3) (7) and (8) are demonstrated in Figure 10

For the distribution of clusters in the result of KMclustering method more than half of data points are clampedinto one oversized cluster The result of this method istherefore not helpful for further operation For HC methoddata on average are allocated into separate clustersThe result

10 International Journal of Distributed Sensor Networks

Table 6 Comparison for log-likelihood of second dataset

Formats KM HC DBScan XM EMVector database minus1735412 minus1962367 minus1753576 minus1721513 minus1657263Raster database minus1815926 minus2012568 minus1970756 minus1815791 minus1848209RasterP (16 grids) minus1551437 minus1724736 minus1637147 minus1701283 minus1566231RasterP (25 grids) minus1484761 minus1663789 minus1509146 minus1667312 minus1647823

1

2

5

4

41

3

32

5

(a)

1

2 3

4

5

4 51

3

2

(b)

5

5

4

1

4

1 3

3

2

2

5

(c)

2

2

4

3

3

1 5

5

4

1

(d)

Figure 10 (a) Clustering results for the first dataset with setting case (2) where weight of V is 40 top half uses KM clustering methodand bottom half uses HC method (b) Clustering results for the first dataset with setting case (3) where weight of V is 50 top half uses KMclustering method and bottom half uses HC method (c) Clustering results for the first dataset with setting case (7) where weight of V is 0top half uses KM clustering method and bottom half uses HC method (d) Clustering results for the first dataset with setting case (3) whereall share the same weight except 119892

119894(V119894= 0) top half uses KM clustering method and bottom half uses HC method

Table 7 Comparison of running time (in seconds) of four differentsizes of dataset

Dataset size KM HC DBScan XM EM LP100 grid cells 006 007 105 219 321 0184600 grid cells 042 295 3989 273 1905 93710000 grid cells 262 4667 9755 297 3785 242180000 grid cells 1975 18961 684 647 19831 9083

in Figure 10(c) is the best showing only the one with distinctposition attributes (119909 and 119910) The other three results (Figures10(a) 10(b) and 10(d)) are stained with cluster overlapsTherefore allocation of critical resource for example in eachcluster may result in a waste of resources The degree ofoverlap is the least in the result of Figure 10(b) If only locationis being considered the result of Figure 10(c) is the bestchoice Otherwise the result in Figure 10(b) is better than theother two for the sake of cluster distribution

The clustering results of the second dataset performanceby using the two methods KM and HC are shown inFigure 11

From the results of the cluster distribution of the seconddataset obtained by both clustering methods the size of eachcluster is more or less similar which is better than that ofthe first dataset And there is no overlap phenomenon inthe KM results This is a promising feature of KM methodfor spatial clustering However there is little overlap in theresult of HC method as the clusters seem to take irregularshapes Above all for the second dataset KM is a better choicefor consideration of even cluster distribution and overlapavoidance by using both clustering methods

53 Results of Grouping In this part we compare the coloredmap of Raster (119909 119910 V) data model in two datasets usingfive clustering methods in Weka and the LP method Thecommon requirement is no overlap for each of the resultingmaps The number of cluster is arbitrarily chosen at five The

International Journal of Distributed Sensor Networks 11

Table 8 Numeric results of coverage of each cluster by using the six methods for dataset 1

Cov-db1 KM EM DBScan XM HC LPCluster 0 0029436 0003786 0017902 0075178 0013153 0028985Cluster 1 0301538 0269602 0208078 0049761 0026016 0377034Cluster 2 0215277 0001627 0158439 0084049 012436 0080099Cluster 3 0046788 0096221 0079177 0209390 0001172 0217204Cluster 4 0002712 0161799 0044197 0043152 03043 0007704Total coverage 0595751 0533036 0507793 0461531 0469 0711025

41 5

3 2

(a)

4

13

5

2

(b)

Figure 11 (a) Clustering results for the second dataset by usingKMmethod (b) Clustering results for the second dataset by usingHCmethod

result of first dataset is shown in Figure 12 The first part (i)of Figure 12 shows the spatial clustering result the secondpart (ii) visualizes the corresponding spatial groups by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method The centers of the clustersare computed after clustering is done and then the groupsare visualized over the clustering results according to thecomputed centers

In Figure 12 for the results of (a) and (e) the sizes ofclusters are quite uneven more than half of dataset fall intoone cluster Thus this result reveals a fact that the techniquecannot organize a dataset into homogeneous andor well-separated groups with respect to a distance or equivalentlya similarity measureThe corresponding groups have overlapphenomenon too For the result of (c) the sizes of the clustersare uneven too For the result of (b) and (d) the sizes of clusterseem to be similar to each other There is also no overlapin the clustering result but for group result the groups in(d) have far more overlaps than those in (b) Overlap meanssome part or the cluster gets in the way of another onewhichmeans that there is superposition between two ormoredifferent clusters Again itmay cause resourcewaste and evenfalse allocation This situation occurs in important fields ofapplications such as information retrieval (several thematicfor a single document) and biological data (several metabolicfunctions for one gene) For this reason (b) is better than (d)According to the above analysis for the result of clusteringand corresponding groups (d) XM is so far the best choiceof clustering algorithm as evidenced by the colored mapsthereafter

With the same experiment setup and operating environ-ment the spatial clustering experiments are performed overthe second dataset The results of second dataset are shown

in Figure 13 where (i) represents the spatial clustering resultand (ii) represents the corresponding spatial group by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method

In Figures 13(a) and 13(e) it is noticed that the clustersare imbalanced and there are overlaps in the correspondingspatial groups using the method of (a) EM and (e) DBScanThe results of (b) KM and (d) XM however avoid theshortcomings of (a) and (e) though they still have slightoverlaps For (c) HC we remove the empty cells in theboundary to reduce the size of dataset the clustering resultis perfect There is no overlap and clusters are balancedbetween each other But there is still overlap in the spatialgroups Thus LP method is adopted to solve this problemand in possession of same size of groups The result of LPmethod yields perfectly balanced groups without any overlapas shown in Figure 13(f)

By visually comparing the clustering results of the twodatasets the clustering results seem to be similar but thespatial groups are somewhat differentOccurrence of overlapsin spatial groups is more severe in the first dataset than in thesecond one The overlaps are likely due to data distributionand balance in sizes between each cluster For the first datasetthe heavy spatial values which are traffic volumes in this caseare mainly concentrated in the center region (city center) solocations of the computed clusters tend to cram very near ata crowded spot In contrast the traffic volumes in the seconddataset are dispersed over a large area As seen from the visualspatial groups of the second dataset the cluster positions area little far apart when compared to those in the first dataset

Based on the results generated from the clustering andLP experiments some statistic information of dataset 2 iscollected and it is shown in Table 2 The numeric results in

12 International Journal of Distributed Sensor Networks

(i)

(ii)

(a)

(i)

(ii)

(b)

(i)

(ii)

(c)

(ii)

(i)

(d)

(i)

(ii)

(e)

Figure 12 (a) (i) Spatial clustering on dataset 1 by using EM (ii) Spatial groups on dataset 1 from the results of using EM (b) (i) spatialclustering on dataset 1 by using KM (ii) Spatial groups on dataset 1 from the results of using KM (c) (i) spatial clustering on dataset 1 byusing HC (ii) Spatial groups on dataset 1 from the results of using HC (d) (i) spatial clustering on dataset 1 by using XM (ii) Spatial groupson dataset 1 from the results of using XM (e) (i) spatial clustering on dataset 1 by using DBScan (ii) Spatial group on dataset 1 from theresults of using DBScan

Table 3 support the qualitative analysis by visual inspectionin the previous section By comparingHC and LPmethods asan example the quantitative results show that they have thegreatest differences in cell numbers covered by the clustersalso the amount of overlap in HC is the highest of all By theLP method the size of each cluster is exactly the same andthey are totally free from overlap

6 Technical Analysis of Clustering Results

61 Experimental Evaluation Method For the purpose ofassessing how the qualities of spatial groups from clusteringare several evaluation factors are defined here running time(short for time) balance log-likelihood overlap densityand coverage For a fair comparison the datasets are run inthe same software environment of the same computer And

assume the number of groups to be five with six differentmethods Running time is the time we used to run eachmethod using the same software in the same computerto completion Balance is used to measure the sizes ofgroups if balanced the size of each group is the same Log-likelihood is an important measure for clustering qualitythe bigger the value the better Weka tests for goodness-of-fit by the likelihood in logarithm which is called log-likelihood A large log-likelihood means that the clusteringmodel is suitable for the data under test Overlap means thatthe spatial values (eg traffic volumes sensed by the sensors)do belong to more than one cluster Density is the averagespatial values (traffic volumes) per grid cell in each clusterCoverage of a cluster means the proportion of traffic volumesthat are covered by the grid cells within the cluster overthe whole dataset meanwhile total coverage is the sum of

International Journal of Distributed Sensor Networks 13

(i)

(ii)

(a)

(i)

(ii)

(b)

(i)

(ii)

(c)

(i)

(ii)

(d)

(i)

(ii)

(e)

(f)

Figure 13 (a) (i) spatial clustering on dataset 2 by using EM (ii) Spatial groups on dataset 2 from the results of using EM (b) (i) spatialclustering on dataset 2 by using KM (ii) Spatial groups on dataset 2 from the results of using KM (c) (i) spatial clustering on dataset 2 byusing HC (ii) Spatial groups on dataset 2 from the results of using HC (d) (i) spatial clustering on dataset 2 by using XM (ii) Spatial groupson dataset 2 from the results of using XM (e) (i) spatial clustering on dataset 2 by using DBScan (ii) Spatial group on dataset 2 from theresults of using DBScan (f) Spatial group in LP method on dataset 2

14 International Journal of Distributed Sensor Networks

traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below

Density (cluster 119894) =sumTraffic Volumes (cluster 119894)

Grid Cell Number (cluster 119894)

Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)

sumGrid Cell Number

Total Coverage = sumTraffic Volumes minusOverlaps

Proportion of Cluster (119894) Size (Balance)

=Grid Cell Number (cluster 119894)

sumGrid Cell Number

(4)

62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively

According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest

In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one

In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14

According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On

0

100

200

300

400

500

600

700

800

900

0 20000 40000 60000 80000 100000

K-meansHierarchicalDBScanXMeanEM

Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP

K-means)

Figure 14 Comparison of running time (in seconds) of differentsizes of dataset

the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally

In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one

In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC

The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets

From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage

International Journal of Distributed Sensor Networks 15

Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2

Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700

0

01

02

03

04

05

06

07

08

KM EM DBScan XM HC LP

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

CoverageCoverage

(a)

0

01

02

03

04

05

06

KM EM DBScan XM HC LP

Coverage

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

Coverage

(b)

Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2

Total density

0

200

400

600

800

1000

1200

1400

1600

KM EM DBScan XM HC LP

Density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(a)

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

KM EM DBScan XM HC LP

Density

Total density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(b)

Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2

16 International Journal of Distributed Sensor Networks

4

5136

8

1

Balance test on dataset 1

(a) KM

1

50

1

18

30

Balance test on dataset 1

(b) XM

6

22

24

30

18

Balance test on dataset 1

(c) EM

24

24

17

20

15

Balance test on dataset 1

(d) DBScan

18

17

22

19

25

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP

From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means

that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered

International Journal of Distributed Sensor Networks 17

17

18

17

24

24

Balance test on dataset 2

(a) KM

24

18

24

18

17

Balance test on dataset 2

(b) XM

47

032

47

Balance test on dataset 2

(c) EM

1010

98

Balance test on dataset 2

(d) DBScan

23

15

24

18

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP

The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely

63 Discussion of G119899119890119905 For all the six evaluation factors each

of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in

18 International Journal of Distributed Sensor Networks

Table 10 Numeric results of density of each cluster by using the six methods for dataset 1

Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049

Table 11 Numeric results of density of each cluster by using the six methods for dataset 2

Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447

order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866

119888can take a relatively very large value or even

1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes

119866119897=

10038161003816100381610038161003816100381610038161003816

LikelihoodTime

10038161003816100381610038161003816100381610038161003816

(5)

119866119887=Difference of Balance

Time (6)

119866119889=DensityTime

(7)

119866119888=CoverageTime

(8)

119866119900=OverlapTime

(9)

119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)

Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)

From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in

Table 12 which allows us to easily compare various methodsand performance aspects

In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866

119897 119866119887 119866119889 119866119888 and 119866

119900are computed

for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13

According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion

7 Conclusion and Future Works

Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for

International Journal of Distributed Sensor Networks 19

Table 12 Performance indicators of the six methods based on dataset 2

Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0

Table 13 Comparison of different clustering and LP methods by119866net indicator

Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132

purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users

The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes

For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be

good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed

References

[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000

[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012

[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012

[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012

[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012

[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003

[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006

[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002

[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008

[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006

[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004

[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998

[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000

20 International Journal of Distributed Sensor Networks

[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004

[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005

[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010

[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998

[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006

[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999

[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001

[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009

[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999

[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000

[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996

[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963

[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

6 International Journal of Distributed Sensor Networks

(a) (b) (c)

Figure 7 Illustration of possible ways of assigning clusters for maximum (a) fish population (b) altitude of terrain and (c) human inhabitantpopulation

[25] The notions of density reachability and density con-nectivity are used as performance indicators for the qualityof clustering [26] A cluster is composed of the group ofobjects in a dataset that are density connected to a particularcenter Any object that falls beyond a cluster is considered asnoise

Ward proposed a clustering method called hierarchicalclustering (HC) in 1963 [27] It tries to find how to formsomething to divide 119875

119899 119875119899minus1

1198751in a way that reduces

the relationship with each group In each step analysisstep it considered every possible cluster pair in group andcombined the two clusters with a very close joining of resultsin ldquoinformation lossrdquo which is given definition by Wardaround ESS (an error sum-of-squares criterion)The idea thatsupports Wardrsquos proposal can be described most simply bythinking of a little single data Take ten objects with scores asan example (2 7 6 6 7 2 2 0 2 0) The loss of informationwould be achieved by calculating ESS with a mean of 34which takes into account the ten scores as a unit as followsESS One group = (2 minus 34)

2+ (7 minus 34)

2+ sdot sdot sdot + (0 minus 34)

2=

4728 However those 10 objects can also be separated intofour groups according to their scores 0 0 0 2 2 2 26 6 and 7 7 Finally for evaluation of the ESS as a sum ofsquares we can obtain four independent error sums of eachsquare Overall the result that divides the 10 objects into 4clusters has no loss of information as follows

ESS One group = ESS group1 + ESS group2

+ ESS group3 + ESS group4 = 0

(2)

The last method we adopted here is linear programming(LP) which contains instituting and producing an answerto optimization problems with linear objective functionsand linear constraints This powerful tool can be used inmany fields especially where many options are possible inthe answers In spatial grouping over a large grid manypossible combinations of positioning the clusters exist Theproblem here is to find a certain number of clusters of

equal size over the area meanwhile the chosen centers ofthe clusters must allow sufficient distance apart from eachother so as to avoid overlapping As an example shownin Figure 7 three clusters would have to be assigned overa spatial area in a way that they would have to covercertain resources The assignment of the clusters howeverwould have to yield a maximum total value summed fromcovered resources In the example the left diagram showsallocating three clusters over the deepwater assuming that theresources are fish hence the maximum harvest The secondexample in the middle of Figure 7 is clustering the highaltitude over the area The last example is trying to coverthe maximum human inhabitants which are concentratedat the coves Given many possible ways of setting up theseclusters LP is used to formulate this allocation problemwith an objective of maximizing the values of the coveredresources

Assuming that the resources could be dynamic forexample animal herds or moving targets whose positionsmay swarm and change over time the optimization is atypical maximal flow problem (or max flow problem) Theoptimization is a type of network flow problem in whichthe goal is to determine the maximum amount of flowthat can occur over arc whish is limited by some capacityrestriction This type of network might be used to modelthe flow of oil in pipeline (in which the amount of oil thatcan flow through a pipe in a unit of time is limited by thediameter of the pipe) Traffic engineers also use this type ofnetwork to determine the maximum number of cars that cantravel through a collection of streets with different capacitiesimposed by the number of lanes in the streets and speed limits[28]

For our spatial clustering we consider each cell of the gridas a node each node is defined as a tuple119898 that contains thecoordinates and the value of the resource that is held in thenode such that 119898(119909

119894 119910119894 119911119894) represents an 119894th node in which

119909119894 119910119894represent the position and 119911

119894represents the value of

resource in the node respectively For the clusters each node

International Journal of Distributed Sensor Networks 7

(1) Load the grid-based spatial information into array 119860(119909 119910 119911) 119860 is a three dimensional array(2) Repeat (through all coordinates of 119909)(3) Repeat (through all coordinates of 119910)(4) If (boundary constraints and overlapping constraints are satisfied) Then(5) 119878(119909

119894 119910119894 119911119894) = 119860(119909

119894 119910119894 119911119894)

(6) End-if(7) End-loop(8) End-loop(9) If size of (119878) ge 119870

(10) Find top 119870 clusters where maxsum119911119894⨁119862119896 copy 119878(119909

119894 119910119894 119911119894) to new array 119862(119909

119894 119910119894 119911119894) forall119894 isin 119862

119896

(11) Else-if(12) 119862(119909

119894 119910119894 119911119894) = 119878(119909

119894 119910119894 119911119894) forall119894

(13) End-if

Pseudocode 1 Pseudocode of the proposed LP model for spatial clustering

can potentially be a center of a cluster and the cluster hasa fixed radius of length 119903 The LP model for our problem ismathematically shown as follows

Total value = ⋃

selected clusters ⟨119862119896|119896=1sdotsdotsdot119870⟩sum

119898119894isin119862119896

119898119894(lowast lowast 119911

119894)

= argmax119883119884

sum

0le119909119894le119883

0le119910119894le119884

119870

sum

119896=1

119911119897ni 119898119897(119909119894 119910119895 119911) oplus 119888

119896

(3)

Subject to the boundary constraints of 2r le |119909119894minus 119909119895| and 2r

le |119909119894minus 119909119895| for all 119894 and 119895 but 119894 = 119895 where119883 is the maximum

width and 119884 is the maximum length of the 2D spatial arearespectively 119896 isin 119870 is the maximum number of clusters and119888119896is the 119896th cluster under consideration in the optimizationIn order to implement the computation as depicted in

(3) for each node we sum each group resources in a shapeof diamond (which geometrically approximates a circle) Byiterating through every combination of119870 nodes in the grid ofsize 119883 by 119884 each current node in the combinations is beingtested by considering it as the center of a cluster that has aradius of r hence storing the resource values of the nodesfrom the potential clusters into a temporary array buffer119860(lowast lowast 119911

119894) The results from those potential clusters which

do satisfy the boundary and nonoverlapping constraints arethen copied to a candidate buffer Out of the clusters whoseresource values are stored in the candidate buffer 119878 thecombination of 119870 clusters that has the great total resourcevalue is selected and their values are placed in the final buffer119862 The corresponding pseudocode is shown in Pseudocode 1

5 Experimental Results and Analysis

In this section the performance of the proposed methodol-ogy is shown by presenting both numerical and visualizedresults for all performance aspects over various algorithms Acase study of road traffic is used in the experimentThe spatialarea is a metropolitan traffic map with roads and streetsspanning all over the place The resource value in this case isthe concentration or density of vehicle traffic flows Sensorsare assumed to have been deployed in every appropriate point

Table 1 Comparison between Bwmorph function and thinningalgorithm

Bwmorph function Thinning algorithmDataset 1 Dataset 2 Dataset 1 Dataset 2

Degree ofthinning Incomplete Complete

Elapsed time(secs) 20 38 100 198

Complexity 119874(119899) 119874(1198992)

of the roads thereby a typical traffic volume is each of thesepoints is known The optimization of spatial clustering inthis case can be thought of as optimal resource allocationfor example cost-effective police patrols gas stations orenvironment-pollution controls are needed among thosedense traffic spots

51 Data Preprocessing Two different factual datasets areused for experiments The first dataset is published byMaricopa Association of Governments in 2008 which isa traffic volume map Traffic volumes were derived fromthe national traffic recording devices Seasonal variation isfactored into the volumes The second dataset is an annualaverage daily traffic of Baltimore County Traffic VolumeMapin 2011 in USA prepared by the Maryland Department ofTransportation and published by March 19 2012 The trafficcount estimates are derived by taking 48-hourmachine countdata and applying factors frompermanent count stationsThetraffic counts represent the resource values in a general sense

After using skeleton extraction a two-tone image wasobtained from the original map Readers are referred to therespective websites where they can see the traffic volume datathat are associated with our two datasets (a) Representativetraffic volume map of dataset 1mdashTraffic Volume Map ofPhoenix AZUSA (httpphoenixgovstreetstrafficvolume-map) (b) Representative traffic volume map of dataset2mdashTraffic Volume Map of Baltimore MD USA (httpwwwmarylandroadscomTraffic Volume MapsTraffic VolumeMapspdf) And the corresponding result skeleton extraction

8 International Journal of Distributed Sensor Networks

Table 2 Important statistics from the clustering and LP experiments

Method Cluster number Number of cells covered Minimum Maximum Overlap

KM

Cluster 1 428 0 3499327 0Cluster 2 468 0 546896 0Cluster 3 448 0 20503007 0Cluster 4 614 0 6894667 0Cluster 5 618 0 900908 0

XM

Cluster 1 615 0 591265 0Cluster 2 457 0 546896 0Cluster 3 609 0 900908 0Cluster 4 465 0 3499327 0Cluster 5 430 0 20503007 0

EM

Cluster 1 1223 0 2292 61817229Cluster 2 7 141048 243705 313018Cluster 3 81 0 3033733 131146577Cluster 4 64 26752 546896 330881249Cluster 5 1201 0 1300026 217950471

DB

Cluster 1 13 23614 33146 327222911Cluster 2 11 1686825 21001 363965818Cluster 3 13 178888 2945283 196118393Cluster 4 11 847733 211008 58940877Cluster 5 2528 0 546896 20554176

HC

Cluster 1 291 0 3499327 0Cluster 2 191 0 20503007 96762283Cluster 3 294 0 1590971 0Cluster 4 224 0 189812 12673555Cluster 5 243 0 546896 0

LP

Cluster 1 221 0 3499327 0Cluster 2 221 0 20503007 0Cluster 3 221 0 1590971 0Cluster 4 221 0 189812 0Cluster 5 221 0 546896 0

Table 3 Comparison for running time of the first dataset

Formats KM HC DBscan XM EM LPVector database 327 1252 2324 278 930 183Raster database 342 1536 2820 284 984 201RasterP (16 grids) 198 134 508 046 057 078RasterP (25 grids) 009 014 115 021 012 053

in dataset 1 is shown in Figure 8 where (a) adopted a kind ofmorphological operation method and (b) adopted thinningalgorithm respectively Likewise the corresponding resultskeleton extraction in the second dataset is shown inFigure 9 where (a) adopted a kind of morphologicaloperation method and (b) adopted thinning algorithmrespectively The comparison result of the two datasets isshown in Table 1

For the raw dataset we firstly perform the image prepro-cessing over it to obtain numerical database

The results of the skeleton extraction as shown in Figures8(b) and 9(b) are more clearly and useful for the following

processing Subsequently the clustering by grid can bereadily obtained from the preprocessed imagesThe extent ofimage thinning is better and more complete by the thinningalgorithm than the Bwmorph function in MATLAB But theelapsed time is longer due to a two-layer iteration nestingprocedure in the program code

The choice of placing a grid on the image follows oneprinciple mesh segmentation is not trying to fall on a con-centrated position of traffic flow Since there is no endpointthe midpoint of the two adjacent values was considered ademarcation point Under this assumption the traffic flow ineach grid is calculated and stored digitally in an Excel file Adigital data for the trafficmap serves as the initial data for thesubsequent clustering process

52 Comparison Result of KM and HC Clustering InXLMiner two methods were used to perform clustering KMand HC In order to compare the two methods for the twodatasets input variables were normalized and the numberof clusters is set at five and maximum iterations at 100 Theinitial centroids are chosen randomly at start Furthermore

International Journal of Distributed Sensor Networks 9

Table 4 Comparison for log-likelihood of first dataset

Formats KM HC DBScan XM EMVector database minus1241868 minus1407265 minus1328599 minus119533 minus1249562Raster database minus1342238 minus1502863 minus1378889 minus129632 minus1339769RasterP (16 grids) 1262264 minus1402266 minus1248583 minus1239419 minus1244993RasterP (25 grids) minus1241868 minus1319417 minus1122207 minus1248201 minus1162048

(a) (b)

Figure 8 (a) Result of skeleton extraction in dataset 1 using Bwmorph function (b) Result of skeleton extraction in dataset 1 using thinningalgorithm

(a) (b)

Figure 9 (a) Result of skeleton extraction in dataset 2 using Bwmorph function (b) Result of skeleton extraction in dataset 2 using thinningalgorithm

Table 5 Comparison for running time of the second dataset

Formats KM HC DBScan XM EM LPVector database 139 134 1553 153 1005 337Raster database 241 1478 1834 217 823 196RasterP (16 grids) 047 801 1274 045 377 144RasterP (25 grids) 035 620 1098 036 296 118

the weights for the corresponding three attributes (119909 119910 V)for each grid (119892

119894= (119909119894 119910119894 V119894)) based on defining weight of

119909 and 119910 could be varied (fine-tuned) and the sum of weightsmust be equal to 1 We tested several variations searching forthe best clustering results (1) weight of V is 20 (2) weightof V is 40 (3) weight of V is 50 (4) weight of V is 60 (5)weight of V is 80 (6) all of them have same weight at 333

(7) weight of V is 0 (8) same weight except when 119892119894(V119894= 0)

and (9) weights of 119909 and 119910 are both 0 except when 119892119894(V119894= 0)

In HC method normalization of the input data waschosen Another option available is similarity measure Itadopts Euclidean distance to measure raw numeric dataMeanwhile the other two options Jaccardrsquos coefficients andmatching coefficient are activated only when the data isbinary

For the above nine cases results of cases (1) to (6) aresimilar in their separate methods And result of (9) is theworst which does not accomplish any clustering Results ofcases (2) (3) (7) and (8) are demonstrated in Figure 10

For the distribution of clusters in the result of KMclustering method more than half of data points are clampedinto one oversized cluster The result of this method istherefore not helpful for further operation For HC methoddata on average are allocated into separate clustersThe result

10 International Journal of Distributed Sensor Networks

Table 6 Comparison for log-likelihood of second dataset

Formats KM HC DBScan XM EMVector database minus1735412 minus1962367 minus1753576 minus1721513 minus1657263Raster database minus1815926 minus2012568 minus1970756 minus1815791 minus1848209RasterP (16 grids) minus1551437 minus1724736 minus1637147 minus1701283 minus1566231RasterP (25 grids) minus1484761 minus1663789 minus1509146 minus1667312 minus1647823

1

2

5

4

41

3

32

5

(a)

1

2 3

4

5

4 51

3

2

(b)

5

5

4

1

4

1 3

3

2

2

5

(c)

2

2

4

3

3

1 5

5

4

1

(d)

Figure 10 (a) Clustering results for the first dataset with setting case (2) where weight of V is 40 top half uses KM clustering methodand bottom half uses HC method (b) Clustering results for the first dataset with setting case (3) where weight of V is 50 top half uses KMclustering method and bottom half uses HC method (c) Clustering results for the first dataset with setting case (7) where weight of V is 0top half uses KM clustering method and bottom half uses HC method (d) Clustering results for the first dataset with setting case (3) whereall share the same weight except 119892

119894(V119894= 0) top half uses KM clustering method and bottom half uses HC method

Table 7 Comparison of running time (in seconds) of four differentsizes of dataset

Dataset size KM HC DBScan XM EM LP100 grid cells 006 007 105 219 321 0184600 grid cells 042 295 3989 273 1905 93710000 grid cells 262 4667 9755 297 3785 242180000 grid cells 1975 18961 684 647 19831 9083

in Figure 10(c) is the best showing only the one with distinctposition attributes (119909 and 119910) The other three results (Figures10(a) 10(b) and 10(d)) are stained with cluster overlapsTherefore allocation of critical resource for example in eachcluster may result in a waste of resources The degree ofoverlap is the least in the result of Figure 10(b) If only locationis being considered the result of Figure 10(c) is the bestchoice Otherwise the result in Figure 10(b) is better than theother two for the sake of cluster distribution

The clustering results of the second dataset performanceby using the two methods KM and HC are shown inFigure 11

From the results of the cluster distribution of the seconddataset obtained by both clustering methods the size of eachcluster is more or less similar which is better than that ofthe first dataset And there is no overlap phenomenon inthe KM results This is a promising feature of KM methodfor spatial clustering However there is little overlap in theresult of HC method as the clusters seem to take irregularshapes Above all for the second dataset KM is a better choicefor consideration of even cluster distribution and overlapavoidance by using both clustering methods

53 Results of Grouping In this part we compare the coloredmap of Raster (119909 119910 V) data model in two datasets usingfive clustering methods in Weka and the LP method Thecommon requirement is no overlap for each of the resultingmaps The number of cluster is arbitrarily chosen at five The

International Journal of Distributed Sensor Networks 11

Table 8 Numeric results of coverage of each cluster by using the six methods for dataset 1

Cov-db1 KM EM DBScan XM HC LPCluster 0 0029436 0003786 0017902 0075178 0013153 0028985Cluster 1 0301538 0269602 0208078 0049761 0026016 0377034Cluster 2 0215277 0001627 0158439 0084049 012436 0080099Cluster 3 0046788 0096221 0079177 0209390 0001172 0217204Cluster 4 0002712 0161799 0044197 0043152 03043 0007704Total coverage 0595751 0533036 0507793 0461531 0469 0711025

41 5

3 2

(a)

4

13

5

2

(b)

Figure 11 (a) Clustering results for the second dataset by usingKMmethod (b) Clustering results for the second dataset by usingHCmethod

result of first dataset is shown in Figure 12 The first part (i)of Figure 12 shows the spatial clustering result the secondpart (ii) visualizes the corresponding spatial groups by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method The centers of the clustersare computed after clustering is done and then the groupsare visualized over the clustering results according to thecomputed centers

In Figure 12 for the results of (a) and (e) the sizes ofclusters are quite uneven more than half of dataset fall intoone cluster Thus this result reveals a fact that the techniquecannot organize a dataset into homogeneous andor well-separated groups with respect to a distance or equivalentlya similarity measureThe corresponding groups have overlapphenomenon too For the result of (c) the sizes of the clustersare uneven too For the result of (b) and (d) the sizes of clusterseem to be similar to each other There is also no overlapin the clustering result but for group result the groups in(d) have far more overlaps than those in (b) Overlap meanssome part or the cluster gets in the way of another onewhichmeans that there is superposition between two ormoredifferent clusters Again itmay cause resourcewaste and evenfalse allocation This situation occurs in important fields ofapplications such as information retrieval (several thematicfor a single document) and biological data (several metabolicfunctions for one gene) For this reason (b) is better than (d)According to the above analysis for the result of clusteringand corresponding groups (d) XM is so far the best choiceof clustering algorithm as evidenced by the colored mapsthereafter

With the same experiment setup and operating environ-ment the spatial clustering experiments are performed overthe second dataset The results of second dataset are shown

in Figure 13 where (i) represents the spatial clustering resultand (ii) represents the corresponding spatial group by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method

In Figures 13(a) and 13(e) it is noticed that the clustersare imbalanced and there are overlaps in the correspondingspatial groups using the method of (a) EM and (e) DBScanThe results of (b) KM and (d) XM however avoid theshortcomings of (a) and (e) though they still have slightoverlaps For (c) HC we remove the empty cells in theboundary to reduce the size of dataset the clustering resultis perfect There is no overlap and clusters are balancedbetween each other But there is still overlap in the spatialgroups Thus LP method is adopted to solve this problemand in possession of same size of groups The result of LPmethod yields perfectly balanced groups without any overlapas shown in Figure 13(f)

By visually comparing the clustering results of the twodatasets the clustering results seem to be similar but thespatial groups are somewhat differentOccurrence of overlapsin spatial groups is more severe in the first dataset than in thesecond one The overlaps are likely due to data distributionand balance in sizes between each cluster For the first datasetthe heavy spatial values which are traffic volumes in this caseare mainly concentrated in the center region (city center) solocations of the computed clusters tend to cram very near ata crowded spot In contrast the traffic volumes in the seconddataset are dispersed over a large area As seen from the visualspatial groups of the second dataset the cluster positions area little far apart when compared to those in the first dataset

Based on the results generated from the clustering andLP experiments some statistic information of dataset 2 iscollected and it is shown in Table 2 The numeric results in

12 International Journal of Distributed Sensor Networks

(i)

(ii)

(a)

(i)

(ii)

(b)

(i)

(ii)

(c)

(ii)

(i)

(d)

(i)

(ii)

(e)

Figure 12 (a) (i) Spatial clustering on dataset 1 by using EM (ii) Spatial groups on dataset 1 from the results of using EM (b) (i) spatialclustering on dataset 1 by using KM (ii) Spatial groups on dataset 1 from the results of using KM (c) (i) spatial clustering on dataset 1 byusing HC (ii) Spatial groups on dataset 1 from the results of using HC (d) (i) spatial clustering on dataset 1 by using XM (ii) Spatial groupson dataset 1 from the results of using XM (e) (i) spatial clustering on dataset 1 by using DBScan (ii) Spatial group on dataset 1 from theresults of using DBScan

Table 3 support the qualitative analysis by visual inspectionin the previous section By comparingHC and LPmethods asan example the quantitative results show that they have thegreatest differences in cell numbers covered by the clustersalso the amount of overlap in HC is the highest of all By theLP method the size of each cluster is exactly the same andthey are totally free from overlap

6 Technical Analysis of Clustering Results

61 Experimental Evaluation Method For the purpose ofassessing how the qualities of spatial groups from clusteringare several evaluation factors are defined here running time(short for time) balance log-likelihood overlap densityand coverage For a fair comparison the datasets are run inthe same software environment of the same computer And

assume the number of groups to be five with six differentmethods Running time is the time we used to run eachmethod using the same software in the same computerto completion Balance is used to measure the sizes ofgroups if balanced the size of each group is the same Log-likelihood is an important measure for clustering qualitythe bigger the value the better Weka tests for goodness-of-fit by the likelihood in logarithm which is called log-likelihood A large log-likelihood means that the clusteringmodel is suitable for the data under test Overlap means thatthe spatial values (eg traffic volumes sensed by the sensors)do belong to more than one cluster Density is the averagespatial values (traffic volumes) per grid cell in each clusterCoverage of a cluster means the proportion of traffic volumesthat are covered by the grid cells within the cluster overthe whole dataset meanwhile total coverage is the sum of

International Journal of Distributed Sensor Networks 13

(i)

(ii)

(a)

(i)

(ii)

(b)

(i)

(ii)

(c)

(i)

(ii)

(d)

(i)

(ii)

(e)

(f)

Figure 13 (a) (i) spatial clustering on dataset 2 by using EM (ii) Spatial groups on dataset 2 from the results of using EM (b) (i) spatialclustering on dataset 2 by using KM (ii) Spatial groups on dataset 2 from the results of using KM (c) (i) spatial clustering on dataset 2 byusing HC (ii) Spatial groups on dataset 2 from the results of using HC (d) (i) spatial clustering on dataset 2 by using XM (ii) Spatial groupson dataset 2 from the results of using XM (e) (i) spatial clustering on dataset 2 by using DBScan (ii) Spatial group on dataset 2 from theresults of using DBScan (f) Spatial group in LP method on dataset 2

14 International Journal of Distributed Sensor Networks

traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below

Density (cluster 119894) =sumTraffic Volumes (cluster 119894)

Grid Cell Number (cluster 119894)

Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)

sumGrid Cell Number

Total Coverage = sumTraffic Volumes minusOverlaps

Proportion of Cluster (119894) Size (Balance)

=Grid Cell Number (cluster 119894)

sumGrid Cell Number

(4)

62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively

According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest

In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one

In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14

According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On

0

100

200

300

400

500

600

700

800

900

0 20000 40000 60000 80000 100000

K-meansHierarchicalDBScanXMeanEM

Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP

K-means)

Figure 14 Comparison of running time (in seconds) of differentsizes of dataset

the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally

In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one

In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC

The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets

From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage

International Journal of Distributed Sensor Networks 15

Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2

Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700

0

01

02

03

04

05

06

07

08

KM EM DBScan XM HC LP

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

CoverageCoverage

(a)

0

01

02

03

04

05

06

KM EM DBScan XM HC LP

Coverage

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

Coverage

(b)

Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2

Total density

0

200

400

600

800

1000

1200

1400

1600

KM EM DBScan XM HC LP

Density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(a)

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

KM EM DBScan XM HC LP

Density

Total density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(b)

Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2

16 International Journal of Distributed Sensor Networks

4

5136

8

1

Balance test on dataset 1

(a) KM

1

50

1

18

30

Balance test on dataset 1

(b) XM

6

22

24

30

18

Balance test on dataset 1

(c) EM

24

24

17

20

15

Balance test on dataset 1

(d) DBScan

18

17

22

19

25

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP

From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means

that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered

International Journal of Distributed Sensor Networks 17

17

18

17

24

24

Balance test on dataset 2

(a) KM

24

18

24

18

17

Balance test on dataset 2

(b) XM

47

032

47

Balance test on dataset 2

(c) EM

1010

98

Balance test on dataset 2

(d) DBScan

23

15

24

18

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP

The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely

63 Discussion of G119899119890119905 For all the six evaluation factors each

of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in

18 International Journal of Distributed Sensor Networks

Table 10 Numeric results of density of each cluster by using the six methods for dataset 1

Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049

Table 11 Numeric results of density of each cluster by using the six methods for dataset 2

Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447

order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866

119888can take a relatively very large value or even

1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes

119866119897=

10038161003816100381610038161003816100381610038161003816

LikelihoodTime

10038161003816100381610038161003816100381610038161003816

(5)

119866119887=Difference of Balance

Time (6)

119866119889=DensityTime

(7)

119866119888=CoverageTime

(8)

119866119900=OverlapTime

(9)

119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)

Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)

From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in

Table 12 which allows us to easily compare various methodsand performance aspects

In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866

119897 119866119887 119866119889 119866119888 and 119866

119900are computed

for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13

According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion

7 Conclusion and Future Works

Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for

International Journal of Distributed Sensor Networks 19

Table 12 Performance indicators of the six methods based on dataset 2

Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0

Table 13 Comparison of different clustering and LP methods by119866net indicator

Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132

purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users

The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes

For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be

good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed

References

[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000

[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012

[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012

[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012

[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012

[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003

[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006

[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002

[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008

[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006

[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004

[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998

[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000

20 International Journal of Distributed Sensor Networks

[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004

[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005

[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010

[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998

[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006

[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999

[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001

[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009

[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999

[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000

[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996

[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963

[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

International Journal of Distributed Sensor Networks 7

(1) Load the grid-based spatial information into array 119860(119909 119910 119911) 119860 is a three dimensional array(2) Repeat (through all coordinates of 119909)(3) Repeat (through all coordinates of 119910)(4) If (boundary constraints and overlapping constraints are satisfied) Then(5) 119878(119909

119894 119910119894 119911119894) = 119860(119909

119894 119910119894 119911119894)

(6) End-if(7) End-loop(8) End-loop(9) If size of (119878) ge 119870

(10) Find top 119870 clusters where maxsum119911119894⨁119862119896 copy 119878(119909

119894 119910119894 119911119894) to new array 119862(119909

119894 119910119894 119911119894) forall119894 isin 119862

119896

(11) Else-if(12) 119862(119909

119894 119910119894 119911119894) = 119878(119909

119894 119910119894 119911119894) forall119894

(13) End-if

Pseudocode 1 Pseudocode of the proposed LP model for spatial clustering

can potentially be a center of a cluster and the cluster hasa fixed radius of length 119903 The LP model for our problem ismathematically shown as follows

Total value = ⋃

selected clusters ⟨119862119896|119896=1sdotsdotsdot119870⟩sum

119898119894isin119862119896

119898119894(lowast lowast 119911

119894)

= argmax119883119884

sum

0le119909119894le119883

0le119910119894le119884

119870

sum

119896=1

119911119897ni 119898119897(119909119894 119910119895 119911) oplus 119888

119896

(3)

Subject to the boundary constraints of 2r le |119909119894minus 119909119895| and 2r

le |119909119894minus 119909119895| for all 119894 and 119895 but 119894 = 119895 where119883 is the maximum

width and 119884 is the maximum length of the 2D spatial arearespectively 119896 isin 119870 is the maximum number of clusters and119888119896is the 119896th cluster under consideration in the optimizationIn order to implement the computation as depicted in

(3) for each node we sum each group resources in a shapeof diamond (which geometrically approximates a circle) Byiterating through every combination of119870 nodes in the grid ofsize 119883 by 119884 each current node in the combinations is beingtested by considering it as the center of a cluster that has aradius of r hence storing the resource values of the nodesfrom the potential clusters into a temporary array buffer119860(lowast lowast 119911

119894) The results from those potential clusters which

do satisfy the boundary and nonoverlapping constraints arethen copied to a candidate buffer Out of the clusters whoseresource values are stored in the candidate buffer 119878 thecombination of 119870 clusters that has the great total resourcevalue is selected and their values are placed in the final buffer119862 The corresponding pseudocode is shown in Pseudocode 1

5 Experimental Results and Analysis

In this section the performance of the proposed methodol-ogy is shown by presenting both numerical and visualizedresults for all performance aspects over various algorithms Acase study of road traffic is used in the experimentThe spatialarea is a metropolitan traffic map with roads and streetsspanning all over the place The resource value in this case isthe concentration or density of vehicle traffic flows Sensorsare assumed to have been deployed in every appropriate point

Table 1 Comparison between Bwmorph function and thinningalgorithm

Bwmorph function Thinning algorithmDataset 1 Dataset 2 Dataset 1 Dataset 2

Degree ofthinning Incomplete Complete

Elapsed time(secs) 20 38 100 198

Complexity 119874(119899) 119874(1198992)

of the roads thereby a typical traffic volume is each of thesepoints is known The optimization of spatial clustering inthis case can be thought of as optimal resource allocationfor example cost-effective police patrols gas stations orenvironment-pollution controls are needed among thosedense traffic spots

51 Data Preprocessing Two different factual datasets areused for experiments The first dataset is published byMaricopa Association of Governments in 2008 which isa traffic volume map Traffic volumes were derived fromthe national traffic recording devices Seasonal variation isfactored into the volumes The second dataset is an annualaverage daily traffic of Baltimore County Traffic VolumeMapin 2011 in USA prepared by the Maryland Department ofTransportation and published by March 19 2012 The trafficcount estimates are derived by taking 48-hourmachine countdata and applying factors frompermanent count stationsThetraffic counts represent the resource values in a general sense

After using skeleton extraction a two-tone image wasobtained from the original map Readers are referred to therespective websites where they can see the traffic volume datathat are associated with our two datasets (a) Representativetraffic volume map of dataset 1mdashTraffic Volume Map ofPhoenix AZUSA (httpphoenixgovstreetstrafficvolume-map) (b) Representative traffic volume map of dataset2mdashTraffic Volume Map of Baltimore MD USA (httpwwwmarylandroadscomTraffic Volume MapsTraffic VolumeMapspdf) And the corresponding result skeleton extraction

8 International Journal of Distributed Sensor Networks

Table 2 Important statistics from the clustering and LP experiments

Method Cluster number Number of cells covered Minimum Maximum Overlap

KM

Cluster 1 428 0 3499327 0Cluster 2 468 0 546896 0Cluster 3 448 0 20503007 0Cluster 4 614 0 6894667 0Cluster 5 618 0 900908 0

XM

Cluster 1 615 0 591265 0Cluster 2 457 0 546896 0Cluster 3 609 0 900908 0Cluster 4 465 0 3499327 0Cluster 5 430 0 20503007 0

EM

Cluster 1 1223 0 2292 61817229Cluster 2 7 141048 243705 313018Cluster 3 81 0 3033733 131146577Cluster 4 64 26752 546896 330881249Cluster 5 1201 0 1300026 217950471

DB

Cluster 1 13 23614 33146 327222911Cluster 2 11 1686825 21001 363965818Cluster 3 13 178888 2945283 196118393Cluster 4 11 847733 211008 58940877Cluster 5 2528 0 546896 20554176

HC

Cluster 1 291 0 3499327 0Cluster 2 191 0 20503007 96762283Cluster 3 294 0 1590971 0Cluster 4 224 0 189812 12673555Cluster 5 243 0 546896 0

LP

Cluster 1 221 0 3499327 0Cluster 2 221 0 20503007 0Cluster 3 221 0 1590971 0Cluster 4 221 0 189812 0Cluster 5 221 0 546896 0

Table 3 Comparison for running time of the first dataset

Formats KM HC DBscan XM EM LPVector database 327 1252 2324 278 930 183Raster database 342 1536 2820 284 984 201RasterP (16 grids) 198 134 508 046 057 078RasterP (25 grids) 009 014 115 021 012 053

in dataset 1 is shown in Figure 8 where (a) adopted a kind ofmorphological operation method and (b) adopted thinningalgorithm respectively Likewise the corresponding resultskeleton extraction in the second dataset is shown inFigure 9 where (a) adopted a kind of morphologicaloperation method and (b) adopted thinning algorithmrespectively The comparison result of the two datasets isshown in Table 1

For the raw dataset we firstly perform the image prepro-cessing over it to obtain numerical database

The results of the skeleton extraction as shown in Figures8(b) and 9(b) are more clearly and useful for the following

processing Subsequently the clustering by grid can bereadily obtained from the preprocessed imagesThe extent ofimage thinning is better and more complete by the thinningalgorithm than the Bwmorph function in MATLAB But theelapsed time is longer due to a two-layer iteration nestingprocedure in the program code

The choice of placing a grid on the image follows oneprinciple mesh segmentation is not trying to fall on a con-centrated position of traffic flow Since there is no endpointthe midpoint of the two adjacent values was considered ademarcation point Under this assumption the traffic flow ineach grid is calculated and stored digitally in an Excel file Adigital data for the trafficmap serves as the initial data for thesubsequent clustering process

52 Comparison Result of KM and HC Clustering InXLMiner two methods were used to perform clustering KMand HC In order to compare the two methods for the twodatasets input variables were normalized and the numberof clusters is set at five and maximum iterations at 100 Theinitial centroids are chosen randomly at start Furthermore

International Journal of Distributed Sensor Networks 9

Table 4 Comparison for log-likelihood of first dataset

Formats KM HC DBScan XM EMVector database minus1241868 minus1407265 minus1328599 minus119533 minus1249562Raster database minus1342238 minus1502863 minus1378889 minus129632 minus1339769RasterP (16 grids) 1262264 minus1402266 minus1248583 minus1239419 minus1244993RasterP (25 grids) minus1241868 minus1319417 minus1122207 minus1248201 minus1162048

(a) (b)

Figure 8 (a) Result of skeleton extraction in dataset 1 using Bwmorph function (b) Result of skeleton extraction in dataset 1 using thinningalgorithm

(a) (b)

Figure 9 (a) Result of skeleton extraction in dataset 2 using Bwmorph function (b) Result of skeleton extraction in dataset 2 using thinningalgorithm

Table 5 Comparison for running time of the second dataset

Formats KM HC DBScan XM EM LPVector database 139 134 1553 153 1005 337Raster database 241 1478 1834 217 823 196RasterP (16 grids) 047 801 1274 045 377 144RasterP (25 grids) 035 620 1098 036 296 118

the weights for the corresponding three attributes (119909 119910 V)for each grid (119892

119894= (119909119894 119910119894 V119894)) based on defining weight of

119909 and 119910 could be varied (fine-tuned) and the sum of weightsmust be equal to 1 We tested several variations searching forthe best clustering results (1) weight of V is 20 (2) weightof V is 40 (3) weight of V is 50 (4) weight of V is 60 (5)weight of V is 80 (6) all of them have same weight at 333

(7) weight of V is 0 (8) same weight except when 119892119894(V119894= 0)

and (9) weights of 119909 and 119910 are both 0 except when 119892119894(V119894= 0)

In HC method normalization of the input data waschosen Another option available is similarity measure Itadopts Euclidean distance to measure raw numeric dataMeanwhile the other two options Jaccardrsquos coefficients andmatching coefficient are activated only when the data isbinary

For the above nine cases results of cases (1) to (6) aresimilar in their separate methods And result of (9) is theworst which does not accomplish any clustering Results ofcases (2) (3) (7) and (8) are demonstrated in Figure 10

For the distribution of clusters in the result of KMclustering method more than half of data points are clampedinto one oversized cluster The result of this method istherefore not helpful for further operation For HC methoddata on average are allocated into separate clustersThe result

10 International Journal of Distributed Sensor Networks

Table 6 Comparison for log-likelihood of second dataset

Formats KM HC DBScan XM EMVector database minus1735412 minus1962367 minus1753576 minus1721513 minus1657263Raster database minus1815926 minus2012568 minus1970756 minus1815791 minus1848209RasterP (16 grids) minus1551437 minus1724736 minus1637147 minus1701283 minus1566231RasterP (25 grids) minus1484761 minus1663789 minus1509146 minus1667312 minus1647823

1

2

5

4

41

3

32

5

(a)

1

2 3

4

5

4 51

3

2

(b)

5

5

4

1

4

1 3

3

2

2

5

(c)

2

2

4

3

3

1 5

5

4

1

(d)

Figure 10 (a) Clustering results for the first dataset with setting case (2) where weight of V is 40 top half uses KM clustering methodand bottom half uses HC method (b) Clustering results for the first dataset with setting case (3) where weight of V is 50 top half uses KMclustering method and bottom half uses HC method (c) Clustering results for the first dataset with setting case (7) where weight of V is 0top half uses KM clustering method and bottom half uses HC method (d) Clustering results for the first dataset with setting case (3) whereall share the same weight except 119892

119894(V119894= 0) top half uses KM clustering method and bottom half uses HC method

Table 7 Comparison of running time (in seconds) of four differentsizes of dataset

Dataset size KM HC DBScan XM EM LP100 grid cells 006 007 105 219 321 0184600 grid cells 042 295 3989 273 1905 93710000 grid cells 262 4667 9755 297 3785 242180000 grid cells 1975 18961 684 647 19831 9083

in Figure 10(c) is the best showing only the one with distinctposition attributes (119909 and 119910) The other three results (Figures10(a) 10(b) and 10(d)) are stained with cluster overlapsTherefore allocation of critical resource for example in eachcluster may result in a waste of resources The degree ofoverlap is the least in the result of Figure 10(b) If only locationis being considered the result of Figure 10(c) is the bestchoice Otherwise the result in Figure 10(b) is better than theother two for the sake of cluster distribution

The clustering results of the second dataset performanceby using the two methods KM and HC are shown inFigure 11

From the results of the cluster distribution of the seconddataset obtained by both clustering methods the size of eachcluster is more or less similar which is better than that ofthe first dataset And there is no overlap phenomenon inthe KM results This is a promising feature of KM methodfor spatial clustering However there is little overlap in theresult of HC method as the clusters seem to take irregularshapes Above all for the second dataset KM is a better choicefor consideration of even cluster distribution and overlapavoidance by using both clustering methods

53 Results of Grouping In this part we compare the coloredmap of Raster (119909 119910 V) data model in two datasets usingfive clustering methods in Weka and the LP method Thecommon requirement is no overlap for each of the resultingmaps The number of cluster is arbitrarily chosen at five The

International Journal of Distributed Sensor Networks 11

Table 8 Numeric results of coverage of each cluster by using the six methods for dataset 1

Cov-db1 KM EM DBScan XM HC LPCluster 0 0029436 0003786 0017902 0075178 0013153 0028985Cluster 1 0301538 0269602 0208078 0049761 0026016 0377034Cluster 2 0215277 0001627 0158439 0084049 012436 0080099Cluster 3 0046788 0096221 0079177 0209390 0001172 0217204Cluster 4 0002712 0161799 0044197 0043152 03043 0007704Total coverage 0595751 0533036 0507793 0461531 0469 0711025

41 5

3 2

(a)

4

13

5

2

(b)

Figure 11 (a) Clustering results for the second dataset by usingKMmethod (b) Clustering results for the second dataset by usingHCmethod

result of first dataset is shown in Figure 12 The first part (i)of Figure 12 shows the spatial clustering result the secondpart (ii) visualizes the corresponding spatial groups by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method The centers of the clustersare computed after clustering is done and then the groupsare visualized over the clustering results according to thecomputed centers

In Figure 12 for the results of (a) and (e) the sizes ofclusters are quite uneven more than half of dataset fall intoone cluster Thus this result reveals a fact that the techniquecannot organize a dataset into homogeneous andor well-separated groups with respect to a distance or equivalentlya similarity measureThe corresponding groups have overlapphenomenon too For the result of (c) the sizes of the clustersare uneven too For the result of (b) and (d) the sizes of clusterseem to be similar to each other There is also no overlapin the clustering result but for group result the groups in(d) have far more overlaps than those in (b) Overlap meanssome part or the cluster gets in the way of another onewhichmeans that there is superposition between two ormoredifferent clusters Again itmay cause resourcewaste and evenfalse allocation This situation occurs in important fields ofapplications such as information retrieval (several thematicfor a single document) and biological data (several metabolicfunctions for one gene) For this reason (b) is better than (d)According to the above analysis for the result of clusteringand corresponding groups (d) XM is so far the best choiceof clustering algorithm as evidenced by the colored mapsthereafter

With the same experiment setup and operating environ-ment the spatial clustering experiments are performed overthe second dataset The results of second dataset are shown

in Figure 13 where (i) represents the spatial clustering resultand (ii) represents the corresponding spatial group by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method

In Figures 13(a) and 13(e) it is noticed that the clustersare imbalanced and there are overlaps in the correspondingspatial groups using the method of (a) EM and (e) DBScanThe results of (b) KM and (d) XM however avoid theshortcomings of (a) and (e) though they still have slightoverlaps For (c) HC we remove the empty cells in theboundary to reduce the size of dataset the clustering resultis perfect There is no overlap and clusters are balancedbetween each other But there is still overlap in the spatialgroups Thus LP method is adopted to solve this problemand in possession of same size of groups The result of LPmethod yields perfectly balanced groups without any overlapas shown in Figure 13(f)

By visually comparing the clustering results of the twodatasets the clustering results seem to be similar but thespatial groups are somewhat differentOccurrence of overlapsin spatial groups is more severe in the first dataset than in thesecond one The overlaps are likely due to data distributionand balance in sizes between each cluster For the first datasetthe heavy spatial values which are traffic volumes in this caseare mainly concentrated in the center region (city center) solocations of the computed clusters tend to cram very near ata crowded spot In contrast the traffic volumes in the seconddataset are dispersed over a large area As seen from the visualspatial groups of the second dataset the cluster positions area little far apart when compared to those in the first dataset

Based on the results generated from the clustering andLP experiments some statistic information of dataset 2 iscollected and it is shown in Table 2 The numeric results in

12 International Journal of Distributed Sensor Networks

(i)

(ii)

(a)

(i)

(ii)

(b)

(i)

(ii)

(c)

(ii)

(i)

(d)

(i)

(ii)

(e)

Figure 12 (a) (i) Spatial clustering on dataset 1 by using EM (ii) Spatial groups on dataset 1 from the results of using EM (b) (i) spatialclustering on dataset 1 by using KM (ii) Spatial groups on dataset 1 from the results of using KM (c) (i) spatial clustering on dataset 1 byusing HC (ii) Spatial groups on dataset 1 from the results of using HC (d) (i) spatial clustering on dataset 1 by using XM (ii) Spatial groupson dataset 1 from the results of using XM (e) (i) spatial clustering on dataset 1 by using DBScan (ii) Spatial group on dataset 1 from theresults of using DBScan

Table 3 support the qualitative analysis by visual inspectionin the previous section By comparingHC and LPmethods asan example the quantitative results show that they have thegreatest differences in cell numbers covered by the clustersalso the amount of overlap in HC is the highest of all By theLP method the size of each cluster is exactly the same andthey are totally free from overlap

6 Technical Analysis of Clustering Results

61 Experimental Evaluation Method For the purpose ofassessing how the qualities of spatial groups from clusteringare several evaluation factors are defined here running time(short for time) balance log-likelihood overlap densityand coverage For a fair comparison the datasets are run inthe same software environment of the same computer And

assume the number of groups to be five with six differentmethods Running time is the time we used to run eachmethod using the same software in the same computerto completion Balance is used to measure the sizes ofgroups if balanced the size of each group is the same Log-likelihood is an important measure for clustering qualitythe bigger the value the better Weka tests for goodness-of-fit by the likelihood in logarithm which is called log-likelihood A large log-likelihood means that the clusteringmodel is suitable for the data under test Overlap means thatthe spatial values (eg traffic volumes sensed by the sensors)do belong to more than one cluster Density is the averagespatial values (traffic volumes) per grid cell in each clusterCoverage of a cluster means the proportion of traffic volumesthat are covered by the grid cells within the cluster overthe whole dataset meanwhile total coverage is the sum of

International Journal of Distributed Sensor Networks 13

(i)

(ii)

(a)

(i)

(ii)

(b)

(i)

(ii)

(c)

(i)

(ii)

(d)

(i)

(ii)

(e)

(f)

Figure 13 (a) (i) spatial clustering on dataset 2 by using EM (ii) Spatial groups on dataset 2 from the results of using EM (b) (i) spatialclustering on dataset 2 by using KM (ii) Spatial groups on dataset 2 from the results of using KM (c) (i) spatial clustering on dataset 2 byusing HC (ii) Spatial groups on dataset 2 from the results of using HC (d) (i) spatial clustering on dataset 2 by using XM (ii) Spatial groupson dataset 2 from the results of using XM (e) (i) spatial clustering on dataset 2 by using DBScan (ii) Spatial group on dataset 2 from theresults of using DBScan (f) Spatial group in LP method on dataset 2

14 International Journal of Distributed Sensor Networks

traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below

Density (cluster 119894) =sumTraffic Volumes (cluster 119894)

Grid Cell Number (cluster 119894)

Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)

sumGrid Cell Number

Total Coverage = sumTraffic Volumes minusOverlaps

Proportion of Cluster (119894) Size (Balance)

=Grid Cell Number (cluster 119894)

sumGrid Cell Number

(4)

62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively

According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest

In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one

In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14

According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On

0

100

200

300

400

500

600

700

800

900

0 20000 40000 60000 80000 100000

K-meansHierarchicalDBScanXMeanEM

Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP

K-means)

Figure 14 Comparison of running time (in seconds) of differentsizes of dataset

the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally

In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one

In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC

The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets

From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage

International Journal of Distributed Sensor Networks 15

Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2

Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700

0

01

02

03

04

05

06

07

08

KM EM DBScan XM HC LP

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

CoverageCoverage

(a)

0

01

02

03

04

05

06

KM EM DBScan XM HC LP

Coverage

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

Coverage

(b)

Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2

Total density

0

200

400

600

800

1000

1200

1400

1600

KM EM DBScan XM HC LP

Density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(a)

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

KM EM DBScan XM HC LP

Density

Total density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(b)

Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2

16 International Journal of Distributed Sensor Networks

4

5136

8

1

Balance test on dataset 1

(a) KM

1

50

1

18

30

Balance test on dataset 1

(b) XM

6

22

24

30

18

Balance test on dataset 1

(c) EM

24

24

17

20

15

Balance test on dataset 1

(d) DBScan

18

17

22

19

25

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP

From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means

that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered

International Journal of Distributed Sensor Networks 17

17

18

17

24

24

Balance test on dataset 2

(a) KM

24

18

24

18

17

Balance test on dataset 2

(b) XM

47

032

47

Balance test on dataset 2

(c) EM

1010

98

Balance test on dataset 2

(d) DBScan

23

15

24

18

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP

The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely

63 Discussion of G119899119890119905 For all the six evaluation factors each

of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in

18 International Journal of Distributed Sensor Networks

Table 10 Numeric results of density of each cluster by using the six methods for dataset 1

Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049

Table 11 Numeric results of density of each cluster by using the six methods for dataset 2

Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447

order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866

119888can take a relatively very large value or even

1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes

119866119897=

10038161003816100381610038161003816100381610038161003816

LikelihoodTime

10038161003816100381610038161003816100381610038161003816

(5)

119866119887=Difference of Balance

Time (6)

119866119889=DensityTime

(7)

119866119888=CoverageTime

(8)

119866119900=OverlapTime

(9)

119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)

Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)

From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in

Table 12 which allows us to easily compare various methodsand performance aspects

In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866

119897 119866119887 119866119889 119866119888 and 119866

119900are computed

for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13

According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion

7 Conclusion and Future Works

Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for

International Journal of Distributed Sensor Networks 19

Table 12 Performance indicators of the six methods based on dataset 2

Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0

Table 13 Comparison of different clustering and LP methods by119866net indicator

Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132

purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users

The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes

For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be

good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed

References

[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000

[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012

[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012

[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012

[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012

[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003

[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006

[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002

[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008

[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006

[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004

[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998

[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000

20 International Journal of Distributed Sensor Networks

[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004

[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005

[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010

[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998

[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006

[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999

[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001

[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009

[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999

[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000

[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996

[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963

[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

8 International Journal of Distributed Sensor Networks

Table 2 Important statistics from the clustering and LP experiments

Method Cluster number Number of cells covered Minimum Maximum Overlap

KM

Cluster 1 428 0 3499327 0Cluster 2 468 0 546896 0Cluster 3 448 0 20503007 0Cluster 4 614 0 6894667 0Cluster 5 618 0 900908 0

XM

Cluster 1 615 0 591265 0Cluster 2 457 0 546896 0Cluster 3 609 0 900908 0Cluster 4 465 0 3499327 0Cluster 5 430 0 20503007 0

EM

Cluster 1 1223 0 2292 61817229Cluster 2 7 141048 243705 313018Cluster 3 81 0 3033733 131146577Cluster 4 64 26752 546896 330881249Cluster 5 1201 0 1300026 217950471

DB

Cluster 1 13 23614 33146 327222911Cluster 2 11 1686825 21001 363965818Cluster 3 13 178888 2945283 196118393Cluster 4 11 847733 211008 58940877Cluster 5 2528 0 546896 20554176

HC

Cluster 1 291 0 3499327 0Cluster 2 191 0 20503007 96762283Cluster 3 294 0 1590971 0Cluster 4 224 0 189812 12673555Cluster 5 243 0 546896 0

LP

Cluster 1 221 0 3499327 0Cluster 2 221 0 20503007 0Cluster 3 221 0 1590971 0Cluster 4 221 0 189812 0Cluster 5 221 0 546896 0

Table 3 Comparison for running time of the first dataset

Formats KM HC DBscan XM EM LPVector database 327 1252 2324 278 930 183Raster database 342 1536 2820 284 984 201RasterP (16 grids) 198 134 508 046 057 078RasterP (25 grids) 009 014 115 021 012 053

in dataset 1 is shown in Figure 8 where (a) adopted a kind ofmorphological operation method and (b) adopted thinningalgorithm respectively Likewise the corresponding resultskeleton extraction in the second dataset is shown inFigure 9 where (a) adopted a kind of morphologicaloperation method and (b) adopted thinning algorithmrespectively The comparison result of the two datasets isshown in Table 1

For the raw dataset we firstly perform the image prepro-cessing over it to obtain numerical database

The results of the skeleton extraction as shown in Figures8(b) and 9(b) are more clearly and useful for the following

processing Subsequently the clustering by grid can bereadily obtained from the preprocessed imagesThe extent ofimage thinning is better and more complete by the thinningalgorithm than the Bwmorph function in MATLAB But theelapsed time is longer due to a two-layer iteration nestingprocedure in the program code

The choice of placing a grid on the image follows oneprinciple mesh segmentation is not trying to fall on a con-centrated position of traffic flow Since there is no endpointthe midpoint of the two adjacent values was considered ademarcation point Under this assumption the traffic flow ineach grid is calculated and stored digitally in an Excel file Adigital data for the trafficmap serves as the initial data for thesubsequent clustering process

52 Comparison Result of KM and HC Clustering InXLMiner two methods were used to perform clustering KMand HC In order to compare the two methods for the twodatasets input variables were normalized and the numberof clusters is set at five and maximum iterations at 100 Theinitial centroids are chosen randomly at start Furthermore

International Journal of Distributed Sensor Networks 9

Table 4 Comparison for log-likelihood of first dataset

Formats KM HC DBScan XM EMVector database minus1241868 minus1407265 minus1328599 minus119533 minus1249562Raster database minus1342238 minus1502863 minus1378889 minus129632 minus1339769RasterP (16 grids) 1262264 minus1402266 minus1248583 minus1239419 minus1244993RasterP (25 grids) minus1241868 minus1319417 minus1122207 minus1248201 minus1162048

(a) (b)

Figure 8 (a) Result of skeleton extraction in dataset 1 using Bwmorph function (b) Result of skeleton extraction in dataset 1 using thinningalgorithm

(a) (b)

Figure 9 (a) Result of skeleton extraction in dataset 2 using Bwmorph function (b) Result of skeleton extraction in dataset 2 using thinningalgorithm

Table 5 Comparison for running time of the second dataset

Formats KM HC DBScan XM EM LPVector database 139 134 1553 153 1005 337Raster database 241 1478 1834 217 823 196RasterP (16 grids) 047 801 1274 045 377 144RasterP (25 grids) 035 620 1098 036 296 118

the weights for the corresponding three attributes (119909 119910 V)for each grid (119892

119894= (119909119894 119910119894 V119894)) based on defining weight of

119909 and 119910 could be varied (fine-tuned) and the sum of weightsmust be equal to 1 We tested several variations searching forthe best clustering results (1) weight of V is 20 (2) weightof V is 40 (3) weight of V is 50 (4) weight of V is 60 (5)weight of V is 80 (6) all of them have same weight at 333

(7) weight of V is 0 (8) same weight except when 119892119894(V119894= 0)

and (9) weights of 119909 and 119910 are both 0 except when 119892119894(V119894= 0)

In HC method normalization of the input data waschosen Another option available is similarity measure Itadopts Euclidean distance to measure raw numeric dataMeanwhile the other two options Jaccardrsquos coefficients andmatching coefficient are activated only when the data isbinary

For the above nine cases results of cases (1) to (6) aresimilar in their separate methods And result of (9) is theworst which does not accomplish any clustering Results ofcases (2) (3) (7) and (8) are demonstrated in Figure 10

For the distribution of clusters in the result of KMclustering method more than half of data points are clampedinto one oversized cluster The result of this method istherefore not helpful for further operation For HC methoddata on average are allocated into separate clustersThe result

10 International Journal of Distributed Sensor Networks

Table 6 Comparison for log-likelihood of second dataset

Formats KM HC DBScan XM EMVector database minus1735412 minus1962367 minus1753576 minus1721513 minus1657263Raster database minus1815926 minus2012568 minus1970756 minus1815791 minus1848209RasterP (16 grids) minus1551437 minus1724736 minus1637147 minus1701283 minus1566231RasterP (25 grids) minus1484761 minus1663789 minus1509146 minus1667312 minus1647823

1

2

5

4

41

3

32

5

(a)

1

2 3

4

5

4 51

3

2

(b)

5

5

4

1

4

1 3

3

2

2

5

(c)

2

2

4

3

3

1 5

5

4

1

(d)

Figure 10 (a) Clustering results for the first dataset with setting case (2) where weight of V is 40 top half uses KM clustering methodand bottom half uses HC method (b) Clustering results for the first dataset with setting case (3) where weight of V is 50 top half uses KMclustering method and bottom half uses HC method (c) Clustering results for the first dataset with setting case (7) where weight of V is 0top half uses KM clustering method and bottom half uses HC method (d) Clustering results for the first dataset with setting case (3) whereall share the same weight except 119892

119894(V119894= 0) top half uses KM clustering method and bottom half uses HC method

Table 7 Comparison of running time (in seconds) of four differentsizes of dataset

Dataset size KM HC DBScan XM EM LP100 grid cells 006 007 105 219 321 0184600 grid cells 042 295 3989 273 1905 93710000 grid cells 262 4667 9755 297 3785 242180000 grid cells 1975 18961 684 647 19831 9083

in Figure 10(c) is the best showing only the one with distinctposition attributes (119909 and 119910) The other three results (Figures10(a) 10(b) and 10(d)) are stained with cluster overlapsTherefore allocation of critical resource for example in eachcluster may result in a waste of resources The degree ofoverlap is the least in the result of Figure 10(b) If only locationis being considered the result of Figure 10(c) is the bestchoice Otherwise the result in Figure 10(b) is better than theother two for the sake of cluster distribution

The clustering results of the second dataset performanceby using the two methods KM and HC are shown inFigure 11

From the results of the cluster distribution of the seconddataset obtained by both clustering methods the size of eachcluster is more or less similar which is better than that ofthe first dataset And there is no overlap phenomenon inthe KM results This is a promising feature of KM methodfor spatial clustering However there is little overlap in theresult of HC method as the clusters seem to take irregularshapes Above all for the second dataset KM is a better choicefor consideration of even cluster distribution and overlapavoidance by using both clustering methods

53 Results of Grouping In this part we compare the coloredmap of Raster (119909 119910 V) data model in two datasets usingfive clustering methods in Weka and the LP method Thecommon requirement is no overlap for each of the resultingmaps The number of cluster is arbitrarily chosen at five The

International Journal of Distributed Sensor Networks 11

Table 8 Numeric results of coverage of each cluster by using the six methods for dataset 1

Cov-db1 KM EM DBScan XM HC LPCluster 0 0029436 0003786 0017902 0075178 0013153 0028985Cluster 1 0301538 0269602 0208078 0049761 0026016 0377034Cluster 2 0215277 0001627 0158439 0084049 012436 0080099Cluster 3 0046788 0096221 0079177 0209390 0001172 0217204Cluster 4 0002712 0161799 0044197 0043152 03043 0007704Total coverage 0595751 0533036 0507793 0461531 0469 0711025

41 5

3 2

(a)

4

13

5

2

(b)

Figure 11 (a) Clustering results for the second dataset by usingKMmethod (b) Clustering results for the second dataset by usingHCmethod

result of first dataset is shown in Figure 12 The first part (i)of Figure 12 shows the spatial clustering result the secondpart (ii) visualizes the corresponding spatial groups by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method The centers of the clustersare computed after clustering is done and then the groupsare visualized over the clustering results according to thecomputed centers

In Figure 12 for the results of (a) and (e) the sizes ofclusters are quite uneven more than half of dataset fall intoone cluster Thus this result reveals a fact that the techniquecannot organize a dataset into homogeneous andor well-separated groups with respect to a distance or equivalentlya similarity measureThe corresponding groups have overlapphenomenon too For the result of (c) the sizes of the clustersare uneven too For the result of (b) and (d) the sizes of clusterseem to be similar to each other There is also no overlapin the clustering result but for group result the groups in(d) have far more overlaps than those in (b) Overlap meanssome part or the cluster gets in the way of another onewhichmeans that there is superposition between two ormoredifferent clusters Again itmay cause resourcewaste and evenfalse allocation This situation occurs in important fields ofapplications such as information retrieval (several thematicfor a single document) and biological data (several metabolicfunctions for one gene) For this reason (b) is better than (d)According to the above analysis for the result of clusteringand corresponding groups (d) XM is so far the best choiceof clustering algorithm as evidenced by the colored mapsthereafter

With the same experiment setup and operating environ-ment the spatial clustering experiments are performed overthe second dataset The results of second dataset are shown

in Figure 13 where (i) represents the spatial clustering resultand (ii) represents the corresponding spatial group by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method

In Figures 13(a) and 13(e) it is noticed that the clustersare imbalanced and there are overlaps in the correspondingspatial groups using the method of (a) EM and (e) DBScanThe results of (b) KM and (d) XM however avoid theshortcomings of (a) and (e) though they still have slightoverlaps For (c) HC we remove the empty cells in theboundary to reduce the size of dataset the clustering resultis perfect There is no overlap and clusters are balancedbetween each other But there is still overlap in the spatialgroups Thus LP method is adopted to solve this problemand in possession of same size of groups The result of LPmethod yields perfectly balanced groups without any overlapas shown in Figure 13(f)

By visually comparing the clustering results of the twodatasets the clustering results seem to be similar but thespatial groups are somewhat differentOccurrence of overlapsin spatial groups is more severe in the first dataset than in thesecond one The overlaps are likely due to data distributionand balance in sizes between each cluster For the first datasetthe heavy spatial values which are traffic volumes in this caseare mainly concentrated in the center region (city center) solocations of the computed clusters tend to cram very near ata crowded spot In contrast the traffic volumes in the seconddataset are dispersed over a large area As seen from the visualspatial groups of the second dataset the cluster positions area little far apart when compared to those in the first dataset

Based on the results generated from the clustering andLP experiments some statistic information of dataset 2 iscollected and it is shown in Table 2 The numeric results in

12 International Journal of Distributed Sensor Networks

(i)

(ii)

(a)

(i)

(ii)

(b)

(i)

(ii)

(c)

(ii)

(i)

(d)

(i)

(ii)

(e)

Figure 12 (a) (i) Spatial clustering on dataset 1 by using EM (ii) Spatial groups on dataset 1 from the results of using EM (b) (i) spatialclustering on dataset 1 by using KM (ii) Spatial groups on dataset 1 from the results of using KM (c) (i) spatial clustering on dataset 1 byusing HC (ii) Spatial groups on dataset 1 from the results of using HC (d) (i) spatial clustering on dataset 1 by using XM (ii) Spatial groupson dataset 1 from the results of using XM (e) (i) spatial clustering on dataset 1 by using DBScan (ii) Spatial group on dataset 1 from theresults of using DBScan

Table 3 support the qualitative analysis by visual inspectionin the previous section By comparingHC and LPmethods asan example the quantitative results show that they have thegreatest differences in cell numbers covered by the clustersalso the amount of overlap in HC is the highest of all By theLP method the size of each cluster is exactly the same andthey are totally free from overlap

6 Technical Analysis of Clustering Results

61 Experimental Evaluation Method For the purpose ofassessing how the qualities of spatial groups from clusteringare several evaluation factors are defined here running time(short for time) balance log-likelihood overlap densityand coverage For a fair comparison the datasets are run inthe same software environment of the same computer And

assume the number of groups to be five with six differentmethods Running time is the time we used to run eachmethod using the same software in the same computerto completion Balance is used to measure the sizes ofgroups if balanced the size of each group is the same Log-likelihood is an important measure for clustering qualitythe bigger the value the better Weka tests for goodness-of-fit by the likelihood in logarithm which is called log-likelihood A large log-likelihood means that the clusteringmodel is suitable for the data under test Overlap means thatthe spatial values (eg traffic volumes sensed by the sensors)do belong to more than one cluster Density is the averagespatial values (traffic volumes) per grid cell in each clusterCoverage of a cluster means the proportion of traffic volumesthat are covered by the grid cells within the cluster overthe whole dataset meanwhile total coverage is the sum of

International Journal of Distributed Sensor Networks 13

(i)

(ii)

(a)

(i)

(ii)

(b)

(i)

(ii)

(c)

(i)

(ii)

(d)

(i)

(ii)

(e)

(f)

Figure 13 (a) (i) spatial clustering on dataset 2 by using EM (ii) Spatial groups on dataset 2 from the results of using EM (b) (i) spatialclustering on dataset 2 by using KM (ii) Spatial groups on dataset 2 from the results of using KM (c) (i) spatial clustering on dataset 2 byusing HC (ii) Spatial groups on dataset 2 from the results of using HC (d) (i) spatial clustering on dataset 2 by using XM (ii) Spatial groupson dataset 2 from the results of using XM (e) (i) spatial clustering on dataset 2 by using DBScan (ii) Spatial group on dataset 2 from theresults of using DBScan (f) Spatial group in LP method on dataset 2

14 International Journal of Distributed Sensor Networks

traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below

Density (cluster 119894) =sumTraffic Volumes (cluster 119894)

Grid Cell Number (cluster 119894)

Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)

sumGrid Cell Number

Total Coverage = sumTraffic Volumes minusOverlaps

Proportion of Cluster (119894) Size (Balance)

=Grid Cell Number (cluster 119894)

sumGrid Cell Number

(4)

62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively

According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest

In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one

In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14

According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On

0

100

200

300

400

500

600

700

800

900

0 20000 40000 60000 80000 100000

K-meansHierarchicalDBScanXMeanEM

Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP

K-means)

Figure 14 Comparison of running time (in seconds) of differentsizes of dataset

the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally

In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one

In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC

The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets

From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage

International Journal of Distributed Sensor Networks 15

Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2

Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700

0

01

02

03

04

05

06

07

08

KM EM DBScan XM HC LP

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

CoverageCoverage

(a)

0

01

02

03

04

05

06

KM EM DBScan XM HC LP

Coverage

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

Coverage

(b)

Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2

Total density

0

200

400

600

800

1000

1200

1400

1600

KM EM DBScan XM HC LP

Density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(a)

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

KM EM DBScan XM HC LP

Density

Total density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(b)

Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2

16 International Journal of Distributed Sensor Networks

4

5136

8

1

Balance test on dataset 1

(a) KM

1

50

1

18

30

Balance test on dataset 1

(b) XM

6

22

24

30

18

Balance test on dataset 1

(c) EM

24

24

17

20

15

Balance test on dataset 1

(d) DBScan

18

17

22

19

25

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP

From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means

that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered

International Journal of Distributed Sensor Networks 17

17

18

17

24

24

Balance test on dataset 2

(a) KM

24

18

24

18

17

Balance test on dataset 2

(b) XM

47

032

47

Balance test on dataset 2

(c) EM

1010

98

Balance test on dataset 2

(d) DBScan

23

15

24

18

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP

The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely

63 Discussion of G119899119890119905 For all the six evaluation factors each

of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in

18 International Journal of Distributed Sensor Networks

Table 10 Numeric results of density of each cluster by using the six methods for dataset 1

Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049

Table 11 Numeric results of density of each cluster by using the six methods for dataset 2

Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447

order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866

119888can take a relatively very large value or even

1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes

119866119897=

10038161003816100381610038161003816100381610038161003816

LikelihoodTime

10038161003816100381610038161003816100381610038161003816

(5)

119866119887=Difference of Balance

Time (6)

119866119889=DensityTime

(7)

119866119888=CoverageTime

(8)

119866119900=OverlapTime

(9)

119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)

Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)

From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in

Table 12 which allows us to easily compare various methodsand performance aspects

In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866

119897 119866119887 119866119889 119866119888 and 119866

119900are computed

for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13

According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion

7 Conclusion and Future Works

Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for

International Journal of Distributed Sensor Networks 19

Table 12 Performance indicators of the six methods based on dataset 2

Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0

Table 13 Comparison of different clustering and LP methods by119866net indicator

Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132

purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users

The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes

For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be

good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed

References

[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000

[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012

[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012

[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012

[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012

[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003

[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006

[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002

[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008

[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006

[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004

[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998

[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000

20 International Journal of Distributed Sensor Networks

[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004

[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005

[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010

[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998

[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006

[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999

[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001

[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009

[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999

[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000

[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996

[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963

[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

International Journal of Distributed Sensor Networks 9

Table 4 Comparison for log-likelihood of first dataset

Formats KM HC DBScan XM EMVector database minus1241868 minus1407265 minus1328599 minus119533 minus1249562Raster database minus1342238 minus1502863 minus1378889 minus129632 minus1339769RasterP (16 grids) 1262264 minus1402266 minus1248583 minus1239419 minus1244993RasterP (25 grids) minus1241868 minus1319417 minus1122207 minus1248201 minus1162048

(a) (b)

Figure 8 (a) Result of skeleton extraction in dataset 1 using Bwmorph function (b) Result of skeleton extraction in dataset 1 using thinningalgorithm

(a) (b)

Figure 9 (a) Result of skeleton extraction in dataset 2 using Bwmorph function (b) Result of skeleton extraction in dataset 2 using thinningalgorithm

Table 5 Comparison for running time of the second dataset

Formats KM HC DBScan XM EM LPVector database 139 134 1553 153 1005 337Raster database 241 1478 1834 217 823 196RasterP (16 grids) 047 801 1274 045 377 144RasterP (25 grids) 035 620 1098 036 296 118

the weights for the corresponding three attributes (119909 119910 V)for each grid (119892

119894= (119909119894 119910119894 V119894)) based on defining weight of

119909 and 119910 could be varied (fine-tuned) and the sum of weightsmust be equal to 1 We tested several variations searching forthe best clustering results (1) weight of V is 20 (2) weightof V is 40 (3) weight of V is 50 (4) weight of V is 60 (5)weight of V is 80 (6) all of them have same weight at 333

(7) weight of V is 0 (8) same weight except when 119892119894(V119894= 0)

and (9) weights of 119909 and 119910 are both 0 except when 119892119894(V119894= 0)

In HC method normalization of the input data waschosen Another option available is similarity measure Itadopts Euclidean distance to measure raw numeric dataMeanwhile the other two options Jaccardrsquos coefficients andmatching coefficient are activated only when the data isbinary

For the above nine cases results of cases (1) to (6) aresimilar in their separate methods And result of (9) is theworst which does not accomplish any clustering Results ofcases (2) (3) (7) and (8) are demonstrated in Figure 10

For the distribution of clusters in the result of KMclustering method more than half of data points are clampedinto one oversized cluster The result of this method istherefore not helpful for further operation For HC methoddata on average are allocated into separate clustersThe result

10 International Journal of Distributed Sensor Networks

Table 6 Comparison for log-likelihood of second dataset

Formats KM HC DBScan XM EMVector database minus1735412 minus1962367 minus1753576 minus1721513 minus1657263Raster database minus1815926 minus2012568 minus1970756 minus1815791 minus1848209RasterP (16 grids) minus1551437 minus1724736 minus1637147 minus1701283 minus1566231RasterP (25 grids) minus1484761 minus1663789 minus1509146 minus1667312 minus1647823

1

2

5

4

41

3

32

5

(a)

1

2 3

4

5

4 51

3

2

(b)

5

5

4

1

4

1 3

3

2

2

5

(c)

2

2

4

3

3

1 5

5

4

1

(d)

Figure 10 (a) Clustering results for the first dataset with setting case (2) where weight of V is 40 top half uses KM clustering methodand bottom half uses HC method (b) Clustering results for the first dataset with setting case (3) where weight of V is 50 top half uses KMclustering method and bottom half uses HC method (c) Clustering results for the first dataset with setting case (7) where weight of V is 0top half uses KM clustering method and bottom half uses HC method (d) Clustering results for the first dataset with setting case (3) whereall share the same weight except 119892

119894(V119894= 0) top half uses KM clustering method and bottom half uses HC method

Table 7 Comparison of running time (in seconds) of four differentsizes of dataset

Dataset size KM HC DBScan XM EM LP100 grid cells 006 007 105 219 321 0184600 grid cells 042 295 3989 273 1905 93710000 grid cells 262 4667 9755 297 3785 242180000 grid cells 1975 18961 684 647 19831 9083

in Figure 10(c) is the best showing only the one with distinctposition attributes (119909 and 119910) The other three results (Figures10(a) 10(b) and 10(d)) are stained with cluster overlapsTherefore allocation of critical resource for example in eachcluster may result in a waste of resources The degree ofoverlap is the least in the result of Figure 10(b) If only locationis being considered the result of Figure 10(c) is the bestchoice Otherwise the result in Figure 10(b) is better than theother two for the sake of cluster distribution

The clustering results of the second dataset performanceby using the two methods KM and HC are shown inFigure 11

From the results of the cluster distribution of the seconddataset obtained by both clustering methods the size of eachcluster is more or less similar which is better than that ofthe first dataset And there is no overlap phenomenon inthe KM results This is a promising feature of KM methodfor spatial clustering However there is little overlap in theresult of HC method as the clusters seem to take irregularshapes Above all for the second dataset KM is a better choicefor consideration of even cluster distribution and overlapavoidance by using both clustering methods

53 Results of Grouping In this part we compare the coloredmap of Raster (119909 119910 V) data model in two datasets usingfive clustering methods in Weka and the LP method Thecommon requirement is no overlap for each of the resultingmaps The number of cluster is arbitrarily chosen at five The

International Journal of Distributed Sensor Networks 11

Table 8 Numeric results of coverage of each cluster by using the six methods for dataset 1

Cov-db1 KM EM DBScan XM HC LPCluster 0 0029436 0003786 0017902 0075178 0013153 0028985Cluster 1 0301538 0269602 0208078 0049761 0026016 0377034Cluster 2 0215277 0001627 0158439 0084049 012436 0080099Cluster 3 0046788 0096221 0079177 0209390 0001172 0217204Cluster 4 0002712 0161799 0044197 0043152 03043 0007704Total coverage 0595751 0533036 0507793 0461531 0469 0711025

41 5

3 2

(a)

4

13

5

2

(b)

Figure 11 (a) Clustering results for the second dataset by usingKMmethod (b) Clustering results for the second dataset by usingHCmethod

result of first dataset is shown in Figure 12 The first part (i)of Figure 12 shows the spatial clustering result the secondpart (ii) visualizes the corresponding spatial groups by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method The centers of the clustersare computed after clustering is done and then the groupsare visualized over the clustering results according to thecomputed centers

In Figure 12 for the results of (a) and (e) the sizes ofclusters are quite uneven more than half of dataset fall intoone cluster Thus this result reveals a fact that the techniquecannot organize a dataset into homogeneous andor well-separated groups with respect to a distance or equivalentlya similarity measureThe corresponding groups have overlapphenomenon too For the result of (c) the sizes of the clustersare uneven too For the result of (b) and (d) the sizes of clusterseem to be similar to each other There is also no overlapin the clustering result but for group result the groups in(d) have far more overlaps than those in (b) Overlap meanssome part or the cluster gets in the way of another onewhichmeans that there is superposition between two ormoredifferent clusters Again itmay cause resourcewaste and evenfalse allocation This situation occurs in important fields ofapplications such as information retrieval (several thematicfor a single document) and biological data (several metabolicfunctions for one gene) For this reason (b) is better than (d)According to the above analysis for the result of clusteringand corresponding groups (d) XM is so far the best choiceof clustering algorithm as evidenced by the colored mapsthereafter

With the same experiment setup and operating environ-ment the spatial clustering experiments are performed overthe second dataset The results of second dataset are shown

in Figure 13 where (i) represents the spatial clustering resultand (ii) represents the corresponding spatial group by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method

In Figures 13(a) and 13(e) it is noticed that the clustersare imbalanced and there are overlaps in the correspondingspatial groups using the method of (a) EM and (e) DBScanThe results of (b) KM and (d) XM however avoid theshortcomings of (a) and (e) though they still have slightoverlaps For (c) HC we remove the empty cells in theboundary to reduce the size of dataset the clustering resultis perfect There is no overlap and clusters are balancedbetween each other But there is still overlap in the spatialgroups Thus LP method is adopted to solve this problemand in possession of same size of groups The result of LPmethod yields perfectly balanced groups without any overlapas shown in Figure 13(f)

By visually comparing the clustering results of the twodatasets the clustering results seem to be similar but thespatial groups are somewhat differentOccurrence of overlapsin spatial groups is more severe in the first dataset than in thesecond one The overlaps are likely due to data distributionand balance in sizes between each cluster For the first datasetthe heavy spatial values which are traffic volumes in this caseare mainly concentrated in the center region (city center) solocations of the computed clusters tend to cram very near ata crowded spot In contrast the traffic volumes in the seconddataset are dispersed over a large area As seen from the visualspatial groups of the second dataset the cluster positions area little far apart when compared to those in the first dataset

Based on the results generated from the clustering andLP experiments some statistic information of dataset 2 iscollected and it is shown in Table 2 The numeric results in

12 International Journal of Distributed Sensor Networks

(i)

(ii)

(a)

(i)

(ii)

(b)

(i)

(ii)

(c)

(ii)

(i)

(d)

(i)

(ii)

(e)

Figure 12 (a) (i) Spatial clustering on dataset 1 by using EM (ii) Spatial groups on dataset 1 from the results of using EM (b) (i) spatialclustering on dataset 1 by using KM (ii) Spatial groups on dataset 1 from the results of using KM (c) (i) spatial clustering on dataset 1 byusing HC (ii) Spatial groups on dataset 1 from the results of using HC (d) (i) spatial clustering on dataset 1 by using XM (ii) Spatial groupson dataset 1 from the results of using XM (e) (i) spatial clustering on dataset 1 by using DBScan (ii) Spatial group on dataset 1 from theresults of using DBScan

Table 3 support the qualitative analysis by visual inspectionin the previous section By comparingHC and LPmethods asan example the quantitative results show that they have thegreatest differences in cell numbers covered by the clustersalso the amount of overlap in HC is the highest of all By theLP method the size of each cluster is exactly the same andthey are totally free from overlap

6 Technical Analysis of Clustering Results

61 Experimental Evaluation Method For the purpose ofassessing how the qualities of spatial groups from clusteringare several evaluation factors are defined here running time(short for time) balance log-likelihood overlap densityand coverage For a fair comparison the datasets are run inthe same software environment of the same computer And

assume the number of groups to be five with six differentmethods Running time is the time we used to run eachmethod using the same software in the same computerto completion Balance is used to measure the sizes ofgroups if balanced the size of each group is the same Log-likelihood is an important measure for clustering qualitythe bigger the value the better Weka tests for goodness-of-fit by the likelihood in logarithm which is called log-likelihood A large log-likelihood means that the clusteringmodel is suitable for the data under test Overlap means thatthe spatial values (eg traffic volumes sensed by the sensors)do belong to more than one cluster Density is the averagespatial values (traffic volumes) per grid cell in each clusterCoverage of a cluster means the proportion of traffic volumesthat are covered by the grid cells within the cluster overthe whole dataset meanwhile total coverage is the sum of

International Journal of Distributed Sensor Networks 13

(i)

(ii)

(a)

(i)

(ii)

(b)

(i)

(ii)

(c)

(i)

(ii)

(d)

(i)

(ii)

(e)

(f)

Figure 13 (a) (i) spatial clustering on dataset 2 by using EM (ii) Spatial groups on dataset 2 from the results of using EM (b) (i) spatialclustering on dataset 2 by using KM (ii) Spatial groups on dataset 2 from the results of using KM (c) (i) spatial clustering on dataset 2 byusing HC (ii) Spatial groups on dataset 2 from the results of using HC (d) (i) spatial clustering on dataset 2 by using XM (ii) Spatial groupson dataset 2 from the results of using XM (e) (i) spatial clustering on dataset 2 by using DBScan (ii) Spatial group on dataset 2 from theresults of using DBScan (f) Spatial group in LP method on dataset 2

14 International Journal of Distributed Sensor Networks

traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below

Density (cluster 119894) =sumTraffic Volumes (cluster 119894)

Grid Cell Number (cluster 119894)

Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)

sumGrid Cell Number

Total Coverage = sumTraffic Volumes minusOverlaps

Proportion of Cluster (119894) Size (Balance)

=Grid Cell Number (cluster 119894)

sumGrid Cell Number

(4)

62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively

According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest

In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one

In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14

According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On

0

100

200

300

400

500

600

700

800

900

0 20000 40000 60000 80000 100000

K-meansHierarchicalDBScanXMeanEM

Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP

K-means)

Figure 14 Comparison of running time (in seconds) of differentsizes of dataset

the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally

In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one

In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC

The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets

From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage

International Journal of Distributed Sensor Networks 15

Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2

Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700

0

01

02

03

04

05

06

07

08

KM EM DBScan XM HC LP

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

CoverageCoverage

(a)

0

01

02

03

04

05

06

KM EM DBScan XM HC LP

Coverage

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

Coverage

(b)

Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2

Total density

0

200

400

600

800

1000

1200

1400

1600

KM EM DBScan XM HC LP

Density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(a)

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

KM EM DBScan XM HC LP

Density

Total density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(b)

Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2

16 International Journal of Distributed Sensor Networks

4

5136

8

1

Balance test on dataset 1

(a) KM

1

50

1

18

30

Balance test on dataset 1

(b) XM

6

22

24

30

18

Balance test on dataset 1

(c) EM

24

24

17

20

15

Balance test on dataset 1

(d) DBScan

18

17

22

19

25

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP

From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means

that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered

International Journal of Distributed Sensor Networks 17

17

18

17

24

24

Balance test on dataset 2

(a) KM

24

18

24

18

17

Balance test on dataset 2

(b) XM

47

032

47

Balance test on dataset 2

(c) EM

1010

98

Balance test on dataset 2

(d) DBScan

23

15

24

18

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP

The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely

63 Discussion of G119899119890119905 For all the six evaluation factors each

of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in

18 International Journal of Distributed Sensor Networks

Table 10 Numeric results of density of each cluster by using the six methods for dataset 1

Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049

Table 11 Numeric results of density of each cluster by using the six methods for dataset 2

Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447

order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866

119888can take a relatively very large value or even

1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes

119866119897=

10038161003816100381610038161003816100381610038161003816

LikelihoodTime

10038161003816100381610038161003816100381610038161003816

(5)

119866119887=Difference of Balance

Time (6)

119866119889=DensityTime

(7)

119866119888=CoverageTime

(8)

119866119900=OverlapTime

(9)

119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)

Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)

From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in

Table 12 which allows us to easily compare various methodsand performance aspects

In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866

119897 119866119887 119866119889 119866119888 and 119866

119900are computed

for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13

According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion

7 Conclusion and Future Works

Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for

International Journal of Distributed Sensor Networks 19

Table 12 Performance indicators of the six methods based on dataset 2

Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0

Table 13 Comparison of different clustering and LP methods by119866net indicator

Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132

purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users

The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes

For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be

good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed

References

[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000

[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012

[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012

[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012

[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012

[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003

[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006

[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002

[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008

[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006

[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004

[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998

[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000

20 International Journal of Distributed Sensor Networks

[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004

[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005

[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010

[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998

[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006

[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999

[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001

[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009

[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999

[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000

[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996

[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963

[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

10 International Journal of Distributed Sensor Networks

Table 6 Comparison for log-likelihood of second dataset

Formats KM HC DBScan XM EMVector database minus1735412 minus1962367 minus1753576 minus1721513 minus1657263Raster database minus1815926 minus2012568 minus1970756 minus1815791 minus1848209RasterP (16 grids) minus1551437 minus1724736 minus1637147 minus1701283 minus1566231RasterP (25 grids) minus1484761 minus1663789 minus1509146 minus1667312 minus1647823

1

2

5

4

41

3

32

5

(a)

1

2 3

4

5

4 51

3

2

(b)

5

5

4

1

4

1 3

3

2

2

5

(c)

2

2

4

3

3

1 5

5

4

1

(d)

Figure 10 (a) Clustering results for the first dataset with setting case (2) where weight of V is 40 top half uses KM clustering methodand bottom half uses HC method (b) Clustering results for the first dataset with setting case (3) where weight of V is 50 top half uses KMclustering method and bottom half uses HC method (c) Clustering results for the first dataset with setting case (7) where weight of V is 0top half uses KM clustering method and bottom half uses HC method (d) Clustering results for the first dataset with setting case (3) whereall share the same weight except 119892

119894(V119894= 0) top half uses KM clustering method and bottom half uses HC method

Table 7 Comparison of running time (in seconds) of four differentsizes of dataset

Dataset size KM HC DBScan XM EM LP100 grid cells 006 007 105 219 321 0184600 grid cells 042 295 3989 273 1905 93710000 grid cells 262 4667 9755 297 3785 242180000 grid cells 1975 18961 684 647 19831 9083

in Figure 10(c) is the best showing only the one with distinctposition attributes (119909 and 119910) The other three results (Figures10(a) 10(b) and 10(d)) are stained with cluster overlapsTherefore allocation of critical resource for example in eachcluster may result in a waste of resources The degree ofoverlap is the least in the result of Figure 10(b) If only locationis being considered the result of Figure 10(c) is the bestchoice Otherwise the result in Figure 10(b) is better than theother two for the sake of cluster distribution

The clustering results of the second dataset performanceby using the two methods KM and HC are shown inFigure 11

From the results of the cluster distribution of the seconddataset obtained by both clustering methods the size of eachcluster is more or less similar which is better than that ofthe first dataset And there is no overlap phenomenon inthe KM results This is a promising feature of KM methodfor spatial clustering However there is little overlap in theresult of HC method as the clusters seem to take irregularshapes Above all for the second dataset KM is a better choicefor consideration of even cluster distribution and overlapavoidance by using both clustering methods

53 Results of Grouping In this part we compare the coloredmap of Raster (119909 119910 V) data model in two datasets usingfive clustering methods in Weka and the LP method Thecommon requirement is no overlap for each of the resultingmaps The number of cluster is arbitrarily chosen at five The

International Journal of Distributed Sensor Networks 11

Table 8 Numeric results of coverage of each cluster by using the six methods for dataset 1

Cov-db1 KM EM DBScan XM HC LPCluster 0 0029436 0003786 0017902 0075178 0013153 0028985Cluster 1 0301538 0269602 0208078 0049761 0026016 0377034Cluster 2 0215277 0001627 0158439 0084049 012436 0080099Cluster 3 0046788 0096221 0079177 0209390 0001172 0217204Cluster 4 0002712 0161799 0044197 0043152 03043 0007704Total coverage 0595751 0533036 0507793 0461531 0469 0711025

41 5

3 2

(a)

4

13

5

2

(b)

Figure 11 (a) Clustering results for the second dataset by usingKMmethod (b) Clustering results for the second dataset by usingHCmethod

result of first dataset is shown in Figure 12 The first part (i)of Figure 12 shows the spatial clustering result the secondpart (ii) visualizes the corresponding spatial groups by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method The centers of the clustersare computed after clustering is done and then the groupsare visualized over the clustering results according to thecomputed centers

In Figure 12 for the results of (a) and (e) the sizes ofclusters are quite uneven more than half of dataset fall intoone cluster Thus this result reveals a fact that the techniquecannot organize a dataset into homogeneous andor well-separated groups with respect to a distance or equivalentlya similarity measureThe corresponding groups have overlapphenomenon too For the result of (c) the sizes of the clustersare uneven too For the result of (b) and (d) the sizes of clusterseem to be similar to each other There is also no overlapin the clustering result but for group result the groups in(d) have far more overlaps than those in (b) Overlap meanssome part or the cluster gets in the way of another onewhichmeans that there is superposition between two ormoredifferent clusters Again itmay cause resourcewaste and evenfalse allocation This situation occurs in important fields ofapplications such as information retrieval (several thematicfor a single document) and biological data (several metabolicfunctions for one gene) For this reason (b) is better than (d)According to the above analysis for the result of clusteringand corresponding groups (d) XM is so far the best choiceof clustering algorithm as evidenced by the colored mapsthereafter

With the same experiment setup and operating environ-ment the spatial clustering experiments are performed overthe second dataset The results of second dataset are shown

in Figure 13 where (i) represents the spatial clustering resultand (ii) represents the corresponding spatial group by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method

In Figures 13(a) and 13(e) it is noticed that the clustersare imbalanced and there are overlaps in the correspondingspatial groups using the method of (a) EM and (e) DBScanThe results of (b) KM and (d) XM however avoid theshortcomings of (a) and (e) though they still have slightoverlaps For (c) HC we remove the empty cells in theboundary to reduce the size of dataset the clustering resultis perfect There is no overlap and clusters are balancedbetween each other But there is still overlap in the spatialgroups Thus LP method is adopted to solve this problemand in possession of same size of groups The result of LPmethod yields perfectly balanced groups without any overlapas shown in Figure 13(f)

By visually comparing the clustering results of the twodatasets the clustering results seem to be similar but thespatial groups are somewhat differentOccurrence of overlapsin spatial groups is more severe in the first dataset than in thesecond one The overlaps are likely due to data distributionand balance in sizes between each cluster For the first datasetthe heavy spatial values which are traffic volumes in this caseare mainly concentrated in the center region (city center) solocations of the computed clusters tend to cram very near ata crowded spot In contrast the traffic volumes in the seconddataset are dispersed over a large area As seen from the visualspatial groups of the second dataset the cluster positions area little far apart when compared to those in the first dataset

Based on the results generated from the clustering andLP experiments some statistic information of dataset 2 iscollected and it is shown in Table 2 The numeric results in

12 International Journal of Distributed Sensor Networks

(i)

(ii)

(a)

(i)

(ii)

(b)

(i)

(ii)

(c)

(ii)

(i)

(d)

(i)

(ii)

(e)

Figure 12 (a) (i) Spatial clustering on dataset 1 by using EM (ii) Spatial groups on dataset 1 from the results of using EM (b) (i) spatialclustering on dataset 1 by using KM (ii) Spatial groups on dataset 1 from the results of using KM (c) (i) spatial clustering on dataset 1 byusing HC (ii) Spatial groups on dataset 1 from the results of using HC (d) (i) spatial clustering on dataset 1 by using XM (ii) Spatial groupson dataset 1 from the results of using XM (e) (i) spatial clustering on dataset 1 by using DBScan (ii) Spatial group on dataset 1 from theresults of using DBScan

Table 3 support the qualitative analysis by visual inspectionin the previous section By comparingHC and LPmethods asan example the quantitative results show that they have thegreatest differences in cell numbers covered by the clustersalso the amount of overlap in HC is the highest of all By theLP method the size of each cluster is exactly the same andthey are totally free from overlap

6 Technical Analysis of Clustering Results

61 Experimental Evaluation Method For the purpose ofassessing how the qualities of spatial groups from clusteringare several evaluation factors are defined here running time(short for time) balance log-likelihood overlap densityand coverage For a fair comparison the datasets are run inthe same software environment of the same computer And

assume the number of groups to be five with six differentmethods Running time is the time we used to run eachmethod using the same software in the same computerto completion Balance is used to measure the sizes ofgroups if balanced the size of each group is the same Log-likelihood is an important measure for clustering qualitythe bigger the value the better Weka tests for goodness-of-fit by the likelihood in logarithm which is called log-likelihood A large log-likelihood means that the clusteringmodel is suitable for the data under test Overlap means thatthe spatial values (eg traffic volumes sensed by the sensors)do belong to more than one cluster Density is the averagespatial values (traffic volumes) per grid cell in each clusterCoverage of a cluster means the proportion of traffic volumesthat are covered by the grid cells within the cluster overthe whole dataset meanwhile total coverage is the sum of

International Journal of Distributed Sensor Networks 13

(i)

(ii)

(a)

(i)

(ii)

(b)

(i)

(ii)

(c)

(i)

(ii)

(d)

(i)

(ii)

(e)

(f)

Figure 13 (a) (i) spatial clustering on dataset 2 by using EM (ii) Spatial groups on dataset 2 from the results of using EM (b) (i) spatialclustering on dataset 2 by using KM (ii) Spatial groups on dataset 2 from the results of using KM (c) (i) spatial clustering on dataset 2 byusing HC (ii) Spatial groups on dataset 2 from the results of using HC (d) (i) spatial clustering on dataset 2 by using XM (ii) Spatial groupson dataset 2 from the results of using XM (e) (i) spatial clustering on dataset 2 by using DBScan (ii) Spatial group on dataset 2 from theresults of using DBScan (f) Spatial group in LP method on dataset 2

14 International Journal of Distributed Sensor Networks

traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below

Density (cluster 119894) =sumTraffic Volumes (cluster 119894)

Grid Cell Number (cluster 119894)

Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)

sumGrid Cell Number

Total Coverage = sumTraffic Volumes minusOverlaps

Proportion of Cluster (119894) Size (Balance)

=Grid Cell Number (cluster 119894)

sumGrid Cell Number

(4)

62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively

According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest

In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one

In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14

According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On

0

100

200

300

400

500

600

700

800

900

0 20000 40000 60000 80000 100000

K-meansHierarchicalDBScanXMeanEM

Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP

K-means)

Figure 14 Comparison of running time (in seconds) of differentsizes of dataset

the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally

In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one

In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC

The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets

From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage

International Journal of Distributed Sensor Networks 15

Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2

Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700

0

01

02

03

04

05

06

07

08

KM EM DBScan XM HC LP

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

CoverageCoverage

(a)

0

01

02

03

04

05

06

KM EM DBScan XM HC LP

Coverage

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

Coverage

(b)

Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2

Total density

0

200

400

600

800

1000

1200

1400

1600

KM EM DBScan XM HC LP

Density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(a)

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

KM EM DBScan XM HC LP

Density

Total density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(b)

Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2

16 International Journal of Distributed Sensor Networks

4

5136

8

1

Balance test on dataset 1

(a) KM

1

50

1

18

30

Balance test on dataset 1

(b) XM

6

22

24

30

18

Balance test on dataset 1

(c) EM

24

24

17

20

15

Balance test on dataset 1

(d) DBScan

18

17

22

19

25

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP

From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means

that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered

International Journal of Distributed Sensor Networks 17

17

18

17

24

24

Balance test on dataset 2

(a) KM

24

18

24

18

17

Balance test on dataset 2

(b) XM

47

032

47

Balance test on dataset 2

(c) EM

1010

98

Balance test on dataset 2

(d) DBScan

23

15

24

18

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP

The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely

63 Discussion of G119899119890119905 For all the six evaluation factors each

of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in

18 International Journal of Distributed Sensor Networks

Table 10 Numeric results of density of each cluster by using the six methods for dataset 1

Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049

Table 11 Numeric results of density of each cluster by using the six methods for dataset 2

Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447

order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866

119888can take a relatively very large value or even

1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes

119866119897=

10038161003816100381610038161003816100381610038161003816

LikelihoodTime

10038161003816100381610038161003816100381610038161003816

(5)

119866119887=Difference of Balance

Time (6)

119866119889=DensityTime

(7)

119866119888=CoverageTime

(8)

119866119900=OverlapTime

(9)

119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)

Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)

From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in

Table 12 which allows us to easily compare various methodsand performance aspects

In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866

119897 119866119887 119866119889 119866119888 and 119866

119900are computed

for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13

According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion

7 Conclusion and Future Works

Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for

International Journal of Distributed Sensor Networks 19

Table 12 Performance indicators of the six methods based on dataset 2

Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0

Table 13 Comparison of different clustering and LP methods by119866net indicator

Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132

purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users

The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes

For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be

good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed

References

[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000

[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012

[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012

[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012

[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012

[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003

[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006

[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002

[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008

[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006

[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004

[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998

[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000

20 International Journal of Distributed Sensor Networks

[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004

[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005

[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010

[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998

[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006

[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999

[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001

[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009

[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999

[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000

[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996

[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963

[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

International Journal of Distributed Sensor Networks 11

Table 8 Numeric results of coverage of each cluster by using the six methods for dataset 1

Cov-db1 KM EM DBScan XM HC LPCluster 0 0029436 0003786 0017902 0075178 0013153 0028985Cluster 1 0301538 0269602 0208078 0049761 0026016 0377034Cluster 2 0215277 0001627 0158439 0084049 012436 0080099Cluster 3 0046788 0096221 0079177 0209390 0001172 0217204Cluster 4 0002712 0161799 0044197 0043152 03043 0007704Total coverage 0595751 0533036 0507793 0461531 0469 0711025

41 5

3 2

(a)

4

13

5

2

(b)

Figure 11 (a) Clustering results for the second dataset by usingKMmethod (b) Clustering results for the second dataset by usingHCmethod

result of first dataset is shown in Figure 12 The first part (i)of Figure 12 shows the spatial clustering result the secondpart (ii) visualizes the corresponding spatial groups by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method The centers of the clustersare computed after clustering is done and then the groupsare visualized over the clustering results according to thecomputed centers

In Figure 12 for the results of (a) and (e) the sizes ofclusters are quite uneven more than half of dataset fall intoone cluster Thus this result reveals a fact that the techniquecannot organize a dataset into homogeneous andor well-separated groups with respect to a distance or equivalentlya similarity measureThe corresponding groups have overlapphenomenon too For the result of (c) the sizes of the clustersare uneven too For the result of (b) and (d) the sizes of clusterseem to be similar to each other There is also no overlapin the clustering result but for group result the groups in(d) have far more overlaps than those in (b) Overlap meanssome part or the cluster gets in the way of another onewhichmeans that there is superposition between two ormoredifferent clusters Again itmay cause resourcewaste and evenfalse allocation This situation occurs in important fields ofapplications such as information retrieval (several thematicfor a single document) and biological data (several metabolicfunctions for one gene) For this reason (b) is better than (d)According to the above analysis for the result of clusteringand corresponding groups (d) XM is so far the best choiceof clustering algorithm as evidenced by the colored mapsthereafter

With the same experiment setup and operating environ-ment the spatial clustering experiments are performed overthe second dataset The results of second dataset are shown

in Figure 13 where (i) represents the spatial clustering resultand (ii) represents the corresponding spatial group by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method

In Figures 13(a) and 13(e) it is noticed that the clustersare imbalanced and there are overlaps in the correspondingspatial groups using the method of (a) EM and (e) DBScanThe results of (b) KM and (d) XM however avoid theshortcomings of (a) and (e) though they still have slightoverlaps For (c) HC we remove the empty cells in theboundary to reduce the size of dataset the clustering resultis perfect There is no overlap and clusters are balancedbetween each other But there is still overlap in the spatialgroups Thus LP method is adopted to solve this problemand in possession of same size of groups The result of LPmethod yields perfectly balanced groups without any overlapas shown in Figure 13(f)

By visually comparing the clustering results of the twodatasets the clustering results seem to be similar but thespatial groups are somewhat differentOccurrence of overlapsin spatial groups is more severe in the first dataset than in thesecond one The overlaps are likely due to data distributionand balance in sizes between each cluster For the first datasetthe heavy spatial values which are traffic volumes in this caseare mainly concentrated in the center region (city center) solocations of the computed clusters tend to cram very near ata crowded spot In contrast the traffic volumes in the seconddataset are dispersed over a large area As seen from the visualspatial groups of the second dataset the cluster positions area little far apart when compared to those in the first dataset

Based on the results generated from the clustering andLP experiments some statistic information of dataset 2 iscollected and it is shown in Table 2 The numeric results in

12 International Journal of Distributed Sensor Networks

(i)

(ii)

(a)

(i)

(ii)

(b)

(i)

(ii)

(c)

(ii)

(i)

(d)

(i)

(ii)

(e)

Figure 12 (a) (i) Spatial clustering on dataset 1 by using EM (ii) Spatial groups on dataset 1 from the results of using EM (b) (i) spatialclustering on dataset 1 by using KM (ii) Spatial groups on dataset 1 from the results of using KM (c) (i) spatial clustering on dataset 1 byusing HC (ii) Spatial groups on dataset 1 from the results of using HC (d) (i) spatial clustering on dataset 1 by using XM (ii) Spatial groupson dataset 1 from the results of using XM (e) (i) spatial clustering on dataset 1 by using DBScan (ii) Spatial group on dataset 1 from theresults of using DBScan

Table 3 support the qualitative analysis by visual inspectionin the previous section By comparingHC and LPmethods asan example the quantitative results show that they have thegreatest differences in cell numbers covered by the clustersalso the amount of overlap in HC is the highest of all By theLP method the size of each cluster is exactly the same andthey are totally free from overlap

6 Technical Analysis of Clustering Results

61 Experimental Evaluation Method For the purpose ofassessing how the qualities of spatial groups from clusteringare several evaluation factors are defined here running time(short for time) balance log-likelihood overlap densityand coverage For a fair comparison the datasets are run inthe same software environment of the same computer And

assume the number of groups to be five with six differentmethods Running time is the time we used to run eachmethod using the same software in the same computerto completion Balance is used to measure the sizes ofgroups if balanced the size of each group is the same Log-likelihood is an important measure for clustering qualitythe bigger the value the better Weka tests for goodness-of-fit by the likelihood in logarithm which is called log-likelihood A large log-likelihood means that the clusteringmodel is suitable for the data under test Overlap means thatthe spatial values (eg traffic volumes sensed by the sensors)do belong to more than one cluster Density is the averagespatial values (traffic volumes) per grid cell in each clusterCoverage of a cluster means the proportion of traffic volumesthat are covered by the grid cells within the cluster overthe whole dataset meanwhile total coverage is the sum of

International Journal of Distributed Sensor Networks 13

(i)

(ii)

(a)

(i)

(ii)

(b)

(i)

(ii)

(c)

(i)

(ii)

(d)

(i)

(ii)

(e)

(f)

Figure 13 (a) (i) spatial clustering on dataset 2 by using EM (ii) Spatial groups on dataset 2 from the results of using EM (b) (i) spatialclustering on dataset 2 by using KM (ii) Spatial groups on dataset 2 from the results of using KM (c) (i) spatial clustering on dataset 2 byusing HC (ii) Spatial groups on dataset 2 from the results of using HC (d) (i) spatial clustering on dataset 2 by using XM (ii) Spatial groupson dataset 2 from the results of using XM (e) (i) spatial clustering on dataset 2 by using DBScan (ii) Spatial group on dataset 2 from theresults of using DBScan (f) Spatial group in LP method on dataset 2

14 International Journal of Distributed Sensor Networks

traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below

Density (cluster 119894) =sumTraffic Volumes (cluster 119894)

Grid Cell Number (cluster 119894)

Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)

sumGrid Cell Number

Total Coverage = sumTraffic Volumes minusOverlaps

Proportion of Cluster (119894) Size (Balance)

=Grid Cell Number (cluster 119894)

sumGrid Cell Number

(4)

62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively

According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest

In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one

In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14

According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On

0

100

200

300

400

500

600

700

800

900

0 20000 40000 60000 80000 100000

K-meansHierarchicalDBScanXMeanEM

Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP

K-means)

Figure 14 Comparison of running time (in seconds) of differentsizes of dataset

the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally

In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one

In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC

The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets

From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage

International Journal of Distributed Sensor Networks 15

Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2

Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700

0

01

02

03

04

05

06

07

08

KM EM DBScan XM HC LP

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

CoverageCoverage

(a)

0

01

02

03

04

05

06

KM EM DBScan XM HC LP

Coverage

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

Coverage

(b)

Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2

Total density

0

200

400

600

800

1000

1200

1400

1600

KM EM DBScan XM HC LP

Density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(a)

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

KM EM DBScan XM HC LP

Density

Total density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(b)

Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2

16 International Journal of Distributed Sensor Networks

4

5136

8

1

Balance test on dataset 1

(a) KM

1

50

1

18

30

Balance test on dataset 1

(b) XM

6

22

24

30

18

Balance test on dataset 1

(c) EM

24

24

17

20

15

Balance test on dataset 1

(d) DBScan

18

17

22

19

25

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP

From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means

that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered

International Journal of Distributed Sensor Networks 17

17

18

17

24

24

Balance test on dataset 2

(a) KM

24

18

24

18

17

Balance test on dataset 2

(b) XM

47

032

47

Balance test on dataset 2

(c) EM

1010

98

Balance test on dataset 2

(d) DBScan

23

15

24

18

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP

The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely

63 Discussion of G119899119890119905 For all the six evaluation factors each

of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in

18 International Journal of Distributed Sensor Networks

Table 10 Numeric results of density of each cluster by using the six methods for dataset 1

Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049

Table 11 Numeric results of density of each cluster by using the six methods for dataset 2

Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447

order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866

119888can take a relatively very large value or even

1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes

119866119897=

10038161003816100381610038161003816100381610038161003816

LikelihoodTime

10038161003816100381610038161003816100381610038161003816

(5)

119866119887=Difference of Balance

Time (6)

119866119889=DensityTime

(7)

119866119888=CoverageTime

(8)

119866119900=OverlapTime

(9)

119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)

Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)

From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in

Table 12 which allows us to easily compare various methodsand performance aspects

In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866

119897 119866119887 119866119889 119866119888 and 119866

119900are computed

for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13

According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion

7 Conclusion and Future Works

Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for

International Journal of Distributed Sensor Networks 19

Table 12 Performance indicators of the six methods based on dataset 2

Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0

Table 13 Comparison of different clustering and LP methods by119866net indicator

Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132

purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users

The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes

For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be

good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed

References

[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000

[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012

[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012

[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012

[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012

[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003

[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006

[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002

[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008

[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006

[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004

[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998

[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000

20 International Journal of Distributed Sensor Networks

[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004

[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005

[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010

[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998

[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006

[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999

[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001

[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009

[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999

[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000

[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996

[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963

[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

12 International Journal of Distributed Sensor Networks

(i)

(ii)

(a)

(i)

(ii)

(b)

(i)

(ii)

(c)

(ii)

(i)

(d)

(i)

(ii)

(e)

Figure 12 (a) (i) Spatial clustering on dataset 1 by using EM (ii) Spatial groups on dataset 1 from the results of using EM (b) (i) spatialclustering on dataset 1 by using KM (ii) Spatial groups on dataset 1 from the results of using KM (c) (i) spatial clustering on dataset 1 byusing HC (ii) Spatial groups on dataset 1 from the results of using HC (d) (i) spatial clustering on dataset 1 by using XM (ii) Spatial groupson dataset 1 from the results of using XM (e) (i) spatial clustering on dataset 1 by using DBScan (ii) Spatial group on dataset 1 from theresults of using DBScan

Table 3 support the qualitative analysis by visual inspectionin the previous section By comparingHC and LPmethods asan example the quantitative results show that they have thegreatest differences in cell numbers covered by the clustersalso the amount of overlap in HC is the highest of all By theLP method the size of each cluster is exactly the same andthey are totally free from overlap

6 Technical Analysis of Clustering Results

61 Experimental Evaluation Method For the purpose ofassessing how the qualities of spatial groups from clusteringare several evaluation factors are defined here running time(short for time) balance log-likelihood overlap densityand coverage For a fair comparison the datasets are run inthe same software environment of the same computer And

assume the number of groups to be five with six differentmethods Running time is the time we used to run eachmethod using the same software in the same computerto completion Balance is used to measure the sizes ofgroups if balanced the size of each group is the same Log-likelihood is an important measure for clustering qualitythe bigger the value the better Weka tests for goodness-of-fit by the likelihood in logarithm which is called log-likelihood A large log-likelihood means that the clusteringmodel is suitable for the data under test Overlap means thatthe spatial values (eg traffic volumes sensed by the sensors)do belong to more than one cluster Density is the averagespatial values (traffic volumes) per grid cell in each clusterCoverage of a cluster means the proportion of traffic volumesthat are covered by the grid cells within the cluster overthe whole dataset meanwhile total coverage is the sum of

International Journal of Distributed Sensor Networks 13

(i)

(ii)

(a)

(i)

(ii)

(b)

(i)

(ii)

(c)

(i)

(ii)

(d)

(i)

(ii)

(e)

(f)

Figure 13 (a) (i) spatial clustering on dataset 2 by using EM (ii) Spatial groups on dataset 2 from the results of using EM (b) (i) spatialclustering on dataset 2 by using KM (ii) Spatial groups on dataset 2 from the results of using KM (c) (i) spatial clustering on dataset 2 byusing HC (ii) Spatial groups on dataset 2 from the results of using HC (d) (i) spatial clustering on dataset 2 by using XM (ii) Spatial groupson dataset 2 from the results of using XM (e) (i) spatial clustering on dataset 2 by using DBScan (ii) Spatial group on dataset 2 from theresults of using DBScan (f) Spatial group in LP method on dataset 2

14 International Journal of Distributed Sensor Networks

traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below

Density (cluster 119894) =sumTraffic Volumes (cluster 119894)

Grid Cell Number (cluster 119894)

Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)

sumGrid Cell Number

Total Coverage = sumTraffic Volumes minusOverlaps

Proportion of Cluster (119894) Size (Balance)

=Grid Cell Number (cluster 119894)

sumGrid Cell Number

(4)

62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively

According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest

In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one

In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14

According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On

0

100

200

300

400

500

600

700

800

900

0 20000 40000 60000 80000 100000

K-meansHierarchicalDBScanXMeanEM

Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP

K-means)

Figure 14 Comparison of running time (in seconds) of differentsizes of dataset

the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally

In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one

In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC

The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets

From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage

International Journal of Distributed Sensor Networks 15

Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2

Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700

0

01

02

03

04

05

06

07

08

KM EM DBScan XM HC LP

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

CoverageCoverage

(a)

0

01

02

03

04

05

06

KM EM DBScan XM HC LP

Coverage

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

Coverage

(b)

Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2

Total density

0

200

400

600

800

1000

1200

1400

1600

KM EM DBScan XM HC LP

Density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(a)

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

KM EM DBScan XM HC LP

Density

Total density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(b)

Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2

16 International Journal of Distributed Sensor Networks

4

5136

8

1

Balance test on dataset 1

(a) KM

1

50

1

18

30

Balance test on dataset 1

(b) XM

6

22

24

30

18

Balance test on dataset 1

(c) EM

24

24

17

20

15

Balance test on dataset 1

(d) DBScan

18

17

22

19

25

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP

From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means

that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered

International Journal of Distributed Sensor Networks 17

17

18

17

24

24

Balance test on dataset 2

(a) KM

24

18

24

18

17

Balance test on dataset 2

(b) XM

47

032

47

Balance test on dataset 2

(c) EM

1010

98

Balance test on dataset 2

(d) DBScan

23

15

24

18

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP

The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely

63 Discussion of G119899119890119905 For all the six evaluation factors each

of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in

18 International Journal of Distributed Sensor Networks

Table 10 Numeric results of density of each cluster by using the six methods for dataset 1

Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049

Table 11 Numeric results of density of each cluster by using the six methods for dataset 2

Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447

order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866

119888can take a relatively very large value or even

1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes

119866119897=

10038161003816100381610038161003816100381610038161003816

LikelihoodTime

10038161003816100381610038161003816100381610038161003816

(5)

119866119887=Difference of Balance

Time (6)

119866119889=DensityTime

(7)

119866119888=CoverageTime

(8)

119866119900=OverlapTime

(9)

119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)

Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)

From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in

Table 12 which allows us to easily compare various methodsand performance aspects

In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866

119897 119866119887 119866119889 119866119888 and 119866

119900are computed

for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13

According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion

7 Conclusion and Future Works

Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for

International Journal of Distributed Sensor Networks 19

Table 12 Performance indicators of the six methods based on dataset 2

Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0

Table 13 Comparison of different clustering and LP methods by119866net indicator

Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132

purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users

The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes

For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be

good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed

References

[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000

[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012

[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012

[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012

[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012

[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003

[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006

[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002

[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008

[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006

[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004

[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998

[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000

20 International Journal of Distributed Sensor Networks

[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004

[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005

[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010

[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998

[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006

[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999

[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001

[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009

[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999

[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000

[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996

[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963

[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

International Journal of Distributed Sensor Networks 13

(i)

(ii)

(a)

(i)

(ii)

(b)

(i)

(ii)

(c)

(i)

(ii)

(d)

(i)

(ii)

(e)

(f)

Figure 13 (a) (i) spatial clustering on dataset 2 by using EM (ii) Spatial groups on dataset 2 from the results of using EM (b) (i) spatialclustering on dataset 2 by using KM (ii) Spatial groups on dataset 2 from the results of using KM (c) (i) spatial clustering on dataset 2 byusing HC (ii) Spatial groups on dataset 2 from the results of using HC (d) (i) spatial clustering on dataset 2 by using XM (ii) Spatial groupson dataset 2 from the results of using XM (e) (i) spatial clustering on dataset 2 by using DBScan (ii) Spatial group on dataset 2 from theresults of using DBScan (f) Spatial group in LP method on dataset 2

14 International Journal of Distributed Sensor Networks

traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below

Density (cluster 119894) =sumTraffic Volumes (cluster 119894)

Grid Cell Number (cluster 119894)

Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)

sumGrid Cell Number

Total Coverage = sumTraffic Volumes minusOverlaps

Proportion of Cluster (119894) Size (Balance)

=Grid Cell Number (cluster 119894)

sumGrid Cell Number

(4)

62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively

According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest

In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one

In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14

According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On

0

100

200

300

400

500

600

700

800

900

0 20000 40000 60000 80000 100000

K-meansHierarchicalDBScanXMeanEM

Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP

K-means)

Figure 14 Comparison of running time (in seconds) of differentsizes of dataset

the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally

In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one

In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC

The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets

From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage

International Journal of Distributed Sensor Networks 15

Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2

Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700

0

01

02

03

04

05

06

07

08

KM EM DBScan XM HC LP

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

CoverageCoverage

(a)

0

01

02

03

04

05

06

KM EM DBScan XM HC LP

Coverage

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

Coverage

(b)

Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2

Total density

0

200

400

600

800

1000

1200

1400

1600

KM EM DBScan XM HC LP

Density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(a)

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

KM EM DBScan XM HC LP

Density

Total density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(b)

Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2

16 International Journal of Distributed Sensor Networks

4

5136

8

1

Balance test on dataset 1

(a) KM

1

50

1

18

30

Balance test on dataset 1

(b) XM

6

22

24

30

18

Balance test on dataset 1

(c) EM

24

24

17

20

15

Balance test on dataset 1

(d) DBScan

18

17

22

19

25

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP

From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means

that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered

International Journal of Distributed Sensor Networks 17

17

18

17

24

24

Balance test on dataset 2

(a) KM

24

18

24

18

17

Balance test on dataset 2

(b) XM

47

032

47

Balance test on dataset 2

(c) EM

1010

98

Balance test on dataset 2

(d) DBScan

23

15

24

18

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP

The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely

63 Discussion of G119899119890119905 For all the six evaluation factors each

of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in

18 International Journal of Distributed Sensor Networks

Table 10 Numeric results of density of each cluster by using the six methods for dataset 1

Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049

Table 11 Numeric results of density of each cluster by using the six methods for dataset 2

Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447

order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866

119888can take a relatively very large value or even

1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes

119866119897=

10038161003816100381610038161003816100381610038161003816

LikelihoodTime

10038161003816100381610038161003816100381610038161003816

(5)

119866119887=Difference of Balance

Time (6)

119866119889=DensityTime

(7)

119866119888=CoverageTime

(8)

119866119900=OverlapTime

(9)

119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)

Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)

From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in

Table 12 which allows us to easily compare various methodsand performance aspects

In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866

119897 119866119887 119866119889 119866119888 and 119866

119900are computed

for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13

According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion

7 Conclusion and Future Works

Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for

International Journal of Distributed Sensor Networks 19

Table 12 Performance indicators of the six methods based on dataset 2

Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0

Table 13 Comparison of different clustering and LP methods by119866net indicator

Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132

purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users

The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes

For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be

good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed

References

[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000

[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012

[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012

[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012

[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012

[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003

[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006

[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002

[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008

[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006

[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004

[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998

[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000

20 International Journal of Distributed Sensor Networks

[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004

[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005

[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010

[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998

[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006

[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999

[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001

[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009

[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999

[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000

[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996

[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963

[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

14 International Journal of Distributed Sensor Networks

traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below

Density (cluster 119894) =sumTraffic Volumes (cluster 119894)

Grid Cell Number (cluster 119894)

Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)

sumGrid Cell Number

Total Coverage = sumTraffic Volumes minusOverlaps

Proportion of Cluster (119894) Size (Balance)

=Grid Cell Number (cluster 119894)

sumGrid Cell Number

(4)

62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively

According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest

In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one

In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14

According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On

0

100

200

300

400

500

600

700

800

900

0 20000 40000 60000 80000 100000

K-meansHierarchicalDBScanXMeanEM

Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP

K-means)

Figure 14 Comparison of running time (in seconds) of differentsizes of dataset

the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally

In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one

In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC

The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets

From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage

International Journal of Distributed Sensor Networks 15

Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2

Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700

0

01

02

03

04

05

06

07

08

KM EM DBScan XM HC LP

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

CoverageCoverage

(a)

0

01

02

03

04

05

06

KM EM DBScan XM HC LP

Coverage

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

Coverage

(b)

Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2

Total density

0

200

400

600

800

1000

1200

1400

1600

KM EM DBScan XM HC LP

Density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(a)

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

KM EM DBScan XM HC LP

Density

Total density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(b)

Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2

16 International Journal of Distributed Sensor Networks

4

5136

8

1

Balance test on dataset 1

(a) KM

1

50

1

18

30

Balance test on dataset 1

(b) XM

6

22

24

30

18

Balance test on dataset 1

(c) EM

24

24

17

20

15

Balance test on dataset 1

(d) DBScan

18

17

22

19

25

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP

From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means

that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered

International Journal of Distributed Sensor Networks 17

17

18

17

24

24

Balance test on dataset 2

(a) KM

24

18

24

18

17

Balance test on dataset 2

(b) XM

47

032

47

Balance test on dataset 2

(c) EM

1010

98

Balance test on dataset 2

(d) DBScan

23

15

24

18

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP

The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely

63 Discussion of G119899119890119905 For all the six evaluation factors each

of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in

18 International Journal of Distributed Sensor Networks

Table 10 Numeric results of density of each cluster by using the six methods for dataset 1

Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049

Table 11 Numeric results of density of each cluster by using the six methods for dataset 2

Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447

order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866

119888can take a relatively very large value or even

1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes

119866119897=

10038161003816100381610038161003816100381610038161003816

LikelihoodTime

10038161003816100381610038161003816100381610038161003816

(5)

119866119887=Difference of Balance

Time (6)

119866119889=DensityTime

(7)

119866119888=CoverageTime

(8)

119866119900=OverlapTime

(9)

119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)

Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)

From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in

Table 12 which allows us to easily compare various methodsand performance aspects

In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866

119897 119866119887 119866119889 119866119888 and 119866

119900are computed

for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13

According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion

7 Conclusion and Future Works

Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for

International Journal of Distributed Sensor Networks 19

Table 12 Performance indicators of the six methods based on dataset 2

Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0

Table 13 Comparison of different clustering and LP methods by119866net indicator

Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132

purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users

The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes

For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be

good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed

References

[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000

[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012

[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012

[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012

[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012

[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003

[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006

[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002

[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008

[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006

[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004

[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998

[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000

20 International Journal of Distributed Sensor Networks

[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004

[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005

[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010

[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998

[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006

[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999

[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001

[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009

[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999

[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000

[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996

[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963

[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

International Journal of Distributed Sensor Networks 15

Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2

Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700

0

01

02

03

04

05

06

07

08

KM EM DBScan XM HC LP

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

CoverageCoverage

(a)

0

01

02

03

04

05

06

KM EM DBScan XM HC LP

Coverage

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4Total coverage

Coverage

(b)

Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2

Total density

0

200

400

600

800

1000

1200

1400

1600

KM EM DBScan XM HC LP

Density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(a)

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

KM EM DBScan XM HC LP

Density

Total density

Cluster 0Cluster 1Cluster 2

Cluster 3Cluster 4

Density

(b)

Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2

16 International Journal of Distributed Sensor Networks

4

5136

8

1

Balance test on dataset 1

(a) KM

1

50

1

18

30

Balance test on dataset 1

(b) XM

6

22

24

30

18

Balance test on dataset 1

(c) EM

24

24

17

20

15

Balance test on dataset 1

(d) DBScan

18

17

22

19

25

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP

From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means

that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered

International Journal of Distributed Sensor Networks 17

17

18

17

24

24

Balance test on dataset 2

(a) KM

24

18

24

18

17

Balance test on dataset 2

(b) XM

47

032

47

Balance test on dataset 2

(c) EM

1010

98

Balance test on dataset 2

(d) DBScan

23

15

24

18

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP

The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely

63 Discussion of G119899119890119905 For all the six evaluation factors each

of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in

18 International Journal of Distributed Sensor Networks

Table 10 Numeric results of density of each cluster by using the six methods for dataset 1

Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049

Table 11 Numeric results of density of each cluster by using the six methods for dataset 2

Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447

order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866

119888can take a relatively very large value or even

1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes

119866119897=

10038161003816100381610038161003816100381610038161003816

LikelihoodTime

10038161003816100381610038161003816100381610038161003816

(5)

119866119887=Difference of Balance

Time (6)

119866119889=DensityTime

(7)

119866119888=CoverageTime

(8)

119866119900=OverlapTime

(9)

119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)

Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)

From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in

Table 12 which allows us to easily compare various methodsand performance aspects

In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866

119897 119866119887 119866119889 119866119888 and 119866

119900are computed

for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13

According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion

7 Conclusion and Future Works

Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for

International Journal of Distributed Sensor Networks 19

Table 12 Performance indicators of the six methods based on dataset 2

Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0

Table 13 Comparison of different clustering and LP methods by119866net indicator

Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132

purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users

The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes

For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be

good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed

References

[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000

[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012

[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012

[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012

[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012

[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003

[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006

[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002

[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008

[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006

[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004

[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998

[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000

20 International Journal of Distributed Sensor Networks

[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004

[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005

[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010

[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998

[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006

[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999

[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001

[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009

[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999

[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000

[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996

[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963

[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

16 International Journal of Distributed Sensor Networks

4

5136

8

1

Balance test on dataset 1

(a) KM

1

50

1

18

30

Balance test on dataset 1

(b) XM

6

22

24

30

18

Balance test on dataset 1

(c) EM

24

24

17

20

15

Balance test on dataset 1

(d) DBScan

18

17

22

19

25

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 1

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP

From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means

that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered

International Journal of Distributed Sensor Networks 17

17

18

17

24

24

Balance test on dataset 2

(a) KM

24

18

24

18

17

Balance test on dataset 2

(b) XM

47

032

47

Balance test on dataset 2

(c) EM

1010

98

Balance test on dataset 2

(d) DBScan

23

15

24

18

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP

The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely

63 Discussion of G119899119890119905 For all the six evaluation factors each

of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in

18 International Journal of Distributed Sensor Networks

Table 10 Numeric results of density of each cluster by using the six methods for dataset 1

Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049

Table 11 Numeric results of density of each cluster by using the six methods for dataset 2

Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447

order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866

119888can take a relatively very large value or even

1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes

119866119897=

10038161003816100381610038161003816100381610038161003816

LikelihoodTime

10038161003816100381610038161003816100381610038161003816

(5)

119866119887=Difference of Balance

Time (6)

119866119889=DensityTime

(7)

119866119888=CoverageTime

(8)

119866119900=OverlapTime

(9)

119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)

Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)

From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in

Table 12 which allows us to easily compare various methodsand performance aspects

In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866

119897 119866119887 119866119889 119866119888 and 119866

119900are computed

for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13

According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion

7 Conclusion and Future Works

Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for

International Journal of Distributed Sensor Networks 19

Table 12 Performance indicators of the six methods based on dataset 2

Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0

Table 13 Comparison of different clustering and LP methods by119866net indicator

Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132

purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users

The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes

For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be

good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed

References

[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000

[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012

[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012

[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012

[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012

[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003

[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006

[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002

[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008

[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006

[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004

[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998

[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000

20 International Journal of Distributed Sensor Networks

[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004

[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005

[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010

[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998

[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006

[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999

[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001

[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009

[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999

[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000

[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996

[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963

[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

International Journal of Distributed Sensor Networks 17

17

18

17

24

24

Balance test on dataset 2

(a) KM

24

18

24

18

17

Balance test on dataset 2

(b) XM

47

032

47

Balance test on dataset 2

(c) EM

1010

98

Balance test on dataset 2

(d) DBScan

23

15

24

18

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(e) HC

20

20

20

20

20

Balance test on dataset 2

Cluster 1Cluster 2Cluster 3

Cluster 4Cluster 5

(f) LP

Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP

The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely

63 Discussion of G119899119890119905 For all the six evaluation factors each

of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in

18 International Journal of Distributed Sensor Networks

Table 10 Numeric results of density of each cluster by using the six methods for dataset 1

Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049

Table 11 Numeric results of density of each cluster by using the six methods for dataset 2

Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447

order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866

119888can take a relatively very large value or even

1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes

119866119897=

10038161003816100381610038161003816100381610038161003816

LikelihoodTime

10038161003816100381610038161003816100381610038161003816

(5)

119866119887=Difference of Balance

Time (6)

119866119889=DensityTime

(7)

119866119888=CoverageTime

(8)

119866119900=OverlapTime

(9)

119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)

Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)

From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in

Table 12 which allows us to easily compare various methodsand performance aspects

In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866

119897 119866119887 119866119889 119866119888 and 119866

119900are computed

for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13

According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion

7 Conclusion and Future Works

Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for

International Journal of Distributed Sensor Networks 19

Table 12 Performance indicators of the six methods based on dataset 2

Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0

Table 13 Comparison of different clustering and LP methods by119866net indicator

Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132

purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users

The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes

For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be

good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed

References

[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000

[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012

[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012

[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012

[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012

[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003

[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006

[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002

[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008

[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006

[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004

[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998

[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000

20 International Journal of Distributed Sensor Networks

[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004

[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005

[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010

[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998

[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006

[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999

[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001

[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009

[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999

[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000

[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996

[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963

[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

18 International Journal of Distributed Sensor Networks

Table 10 Numeric results of density of each cluster by using the six methods for dataset 1

Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049

Table 11 Numeric results of density of each cluster by using the six methods for dataset 2

Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447

order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866

119888can take a relatively very large value or even

1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes

119866119897=

10038161003816100381610038161003816100381610038161003816

LikelihoodTime

10038161003816100381610038161003816100381610038161003816

(5)

119866119887=Difference of Balance

Time (6)

119866119889=DensityTime

(7)

119866119888=CoverageTime

(8)

119866119900=OverlapTime

(9)

119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)

Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)

From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in

Table 12 which allows us to easily compare various methodsand performance aspects

In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866

119897 119866119887 119866119889 119866119888 and 119866

119900are computed

for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13

According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion

7 Conclusion and Future Works

Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for

International Journal of Distributed Sensor Networks 19

Table 12 Performance indicators of the six methods based on dataset 2

Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0

Table 13 Comparison of different clustering and LP methods by119866net indicator

Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132

purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users

The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes

For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be

good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed

References

[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000

[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012

[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012

[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012

[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012

[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003

[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006

[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002

[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008

[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006

[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004

[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998

[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000

20 International Journal of Distributed Sensor Networks

[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004

[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005

[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010

[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998

[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006

[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999

[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001

[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009

[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999

[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000

[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996

[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963

[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

International Journal of Distributed Sensor Networks 19

Table 12 Performance indicators of the six methods based on dataset 2

Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0

Table 13 Comparison of different clustering and LP methods by119866net indicator

Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132

purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users

The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes

For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be

good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed

References

[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000

[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012

[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012

[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012

[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012

[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003

[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006

[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002

[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008

[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006

[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004

[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998

[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000

20 International Journal of Distributed Sensor Networks

[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004

[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005

[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010

[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998

[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006

[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999

[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001

[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009

[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999

[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000

[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996

[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963

[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

20 International Journal of Distributed Sensor Networks

[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004

[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005

[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010

[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998

[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006

[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999

[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001

[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009

[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999

[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000

[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996

[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998

[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963

[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of