identifying optimal spatial groups for maximum coverage in ubiquitous sensor network by using...
TRANSCRIPT
Hindawi Publishing CorporationInternational Journal of Distributed Sensor NetworksVolume 2013 Article ID 763027 20 pageshttpdxdoiorg1011552013763027
Research ArticleIdentifying Optimal Spatial Groups for Maximum Coverage inUbiquitous Sensor Network by Using Clustering Algorithms
Simon Fong1 Weng Fai Ip1 Elaine Liu1 and Kyungeun Cho2
1 Department of Computer and Information Science University of Macau Macau2Department of Multimedia Engineering Dongguk University-Seoul Seoul 100-715 Republic of Korea
Correspondence should be addressed to Simon Fong ccfongumacmo
Received 23 March 2013 Accepted 2 June 2013
Academic Editor Sabah Mohammed
Copyright copy 2013 Simon Fong et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
Ubiquitous sensor network has a history of applications varying from monitoring troop movement during battles in WWIIto measuring traffic flows on modern highways In particular there lies a computational challenge in how these data can beefficiently processed for real-time intelligenceGiven the data collected fromubiquitous sensor networks that have different densitiesdistributed over a large geographical area one can see how separate groups could be formed over them in order to maximize thetotal coverage by these groupsThe applications could be either destructive or constructive in nature for example a jet fighter pilotneeds to make a real-time critical decision at a split of second to locate several separate targets to hit (assuming limited weaponpayloads) in order to cause maximum damage when it flies over an enemy terrain a town planner is considering where to stationcertain resources (sites for schools hospitals security patrol route planning airborne food ration drops for humanitarian aid etc)for maximum effect given a vast area of different densities for benevolent purposes This paper explores this problem via optimalldquospatial groupsrdquo clustering Simulation experiments by using clustering algorithms and linear programming are to be conductedfor evaluating their effectiveness comparatively
1 Introduction
Ubiquitous sensor network is a kind of wireless sensortechnology [1] that has sensors distributed far and wideusually covering a large geographical area like forest battlefield or road networks of an urban city Few successfulcase scenarios have been in place in the literature such asmonitoring vegetable freshness by using oxygen and carbondioxide sensors in farms [2] chemical leak detection inhazardous sites [3] general-purpose sensor networks thatmonitor fire [4] and operation underwater [5] What theseapplications have in common is the need of a postprocessingstep that crunch over the data possibly in real time and tomake a quick and accurate prediction out of the analysis
In this paper we consider a special case of postprocessingof such ubiquitous sensor network data Given a vast distribu-tion of sensors each of which collects some information aboutthe local proximity some groups or clusters are to be formedover those The groups should be formed in such a way that
the total overall ldquovaluerdquo of all the values from all the groupsmust be maximizedThe value(s) which should be part of theattribute information being collected by the sensors may besomething that is of the userrsquos concern The values usuallyrepresent the density of a proximity where a sensor standsfor example concentrate of some chemical gas traffic volumeimportance of military target or even head counts of castlesor humans
Intuitively one would prefer the groups to be centered onthe most valuable values over the area the groups should notoverlapmuch of each other so the overlapped effectmay evenget cancelled out or wasted in vain Here some reasonableassumptions would have to be held valid each group wouldhave a limited diameter of effect each group is in the shape ofa concentric circle the areas where the circles (groups) coversum up to a total coverage aka maximum net effect and wecan form only a limited number of such circlesThis would bean interesting mathematical problem but it has a significantimpact on ubiquitous sensor network applications It not only
2 International Journal of Distributed Sensor Networks
determines on how we should distribute the sensors butalso after the deployment how these logical groups are beingformed possibly for further applications
For experiments we attempted to apply several clusteringalgorithms the choices of these algorithms are classical andpopular in data mining research community The effec-tiveness of different clustering algorithms is measured forcomparison However none of the clustering algorithms canachieve the best results At the end we develop a simple andnovelmethod based on linear programming for optimizationwhich we called LP LP is shown to be able to achieveoptimal grouping over different configurations and cases ofexperiments
The contribution of the paper is an in-depth investiga-tion into the grouping problem that arises right after thedeployment of a ubiquitous sensor network We propose anovel solution to achieve optimal groups by using linearprogramming though several clustering algorithms havebeen put into test
The remaining of the paper is structured as followsSection 2 introduces the background techniques of spatialclustering Section 3 surveys on spatial data representa-tion how the spatial data are encoded for postprocessingSection 4 describes our methodology for obtaining optimalgroups over spatial data Section 5 reports about the exper-iments Section 6 analyses and compares the experimentalresults Section 7 concludes the paper
2 Overview of Spatial DataClustering Techniques
Clustering is the organization of a dataset into homogeneousandor well-separated groups with respect to a distance orequivalently a similarity measure [6] Spatial data clusteringhas numerous applications in pattern recognition [7 8]spatial data analysis [9ndash11] market research and so forth [1213] which gather data to find all non concentrate models andspecial things among geographical dataset And spatial dataclustering is an important instrument of spatial data miningwhich has become a powerful tool for efficient and complexanalysis of huge spatial databases [12]with geometric featuresand is liked by most people Conventional methods ofclustering algorithm are classified into four types partitionmethod hierarchical method density-based method andgrid-based method
However with the extension of research objects andscope it has been discovered to have shortcomings Manyexisting spatial clustering algorithms cannot cluster withirregular obstacles reliably A grid-density-based hierarchicalclustering (HC) algorithm is proposed to tackle this problemThe advantage of grid-based clustering algorithm is reducingthe quality of calculation An alternative approach [14] isproposed that can effectively form clustering in the presenceof obstacles The shapes of the clusters can be arbitrarilydefined Moreover the hierarchical strategy is used to reducethe complexity in presence of obstacles and constraints andto improve the operation efficiency [13] And the result isthat it can deal with spatial clustering while it faces obstacles
and constraints and get better performanceWhen some datapoints do not work in any cluster for density clustering thissituation was managed by using grid-based HC instead inthis study And each clustering algorithm has its individualadvantages and disadvantages
Thepartition approach separates119873objects into119898 groupsmeanwhile 119898 satisfies the following constraints firstly eachgroup contains one object at least secondly each objectmust belong to one group In order to achieve a globaloptimum in grouping it is necessary to list all possibleclusters and most applications will adopt KM K-medoidor fuzzy analysis However this partitioning method consistsof some problems when applied to spatial mining towardsclustered objects especially the objects that are obstructedby some environment conditions Such as river it is hard torecognize the comparability
This method can be efficiently carried out by clusteringno matter how large the number of objects Cluster analysisalgorithm in general cannot deal with large datasets It isrecommended that the maximum number of objects to dealwith in this method should be no more than 1000 It is astochastic huntingway based on partitioning in the clusteringmethod due to its low efficiency and the capability of thismethod is much affected by the random selection of thestochastic initial value [14]
HC is another popular clustering method which ismore flexible than partitioning-based clustering but it has ahigher time complexity HC algorithms create a hierarchicaldecomposition (aka dendrogram) of dataset based on somecriterion [15 16] According to the rule of generation inhierarchical decomposition there are two different types ofHC methods agglomerative and divisive For agglomerativealgorithm it starts with producing leaves and it combinesclusters in a bottom-up way For divisive algorithm theclustering starts at the root and it recursively separates theclusters in a top-down way The process continues until astopping criterionmdashusually it stops when the required 119896
clusters are accomplished However this hierarchical methodconsists of some problems vagueness of termination criteriaonce a step is complete it cannot be revoked As long as theneighborhood density (the number of objects or data points)does not grow over a certain threshold clustering continues[17] In other words for a given point in each cluster in theneighborhood of a given radius it must contain at least aminimum number of points As a result the noisy data canbe filtered and better clusters with arbitrary shape can befound DBScan and its expansion algorithm which is calledOPTICS are two of these classical density-based methodsThey perform clustering based on the type of density-basedconnectivity
Grid-based method quantifies object space to a restrictednumber of cells forming a grid structure All clusteringoperations are in the grid structure (ie quantitative space)The main benefit of this method is its high speed therun time is usually not restricted by the data size whichonly depends on the number of cells in each dimensionThe algorithm STING (statistical information grid-basedmethod) [18] works with numerical attributes (spatial data)and is designed to facilitate ldquoregion-orientedrdquo queries
International Journal of Distributed Sensor Networks 3
Nevertheless the spatial groups obtained by classic algo-rithms have certain limitations that is overlaps cannot becontrolled and the maximum coverage by the resultantgroups is not guaranteed Overlaps lead to resource wasteand potentially resourcemismatch Besides spatial clusteringthis situation occurs in other fields of applications suchas the information retrieval (several thematic for a singledocument) biological data (several metabolic functions forone gene) and martial purpose (discover object densenessregions independently) However there has been no studyreported in the literature that the authors are aware ofthat applies LP method to discover spatial groups that arefree of the limitations inherited from clustering algorithmsThus this research provides an alternative method to achievespatial groups for maximum coverage in real environmentMaximum coverage in this context here is defined as thegreatest possible area of effect covered by the spatial groupswith none or minimum overlaps among the groups
3 Spatial Data Representation
Two main categories of spatial data representation existspatial data and attribute data Spatial data means refer-enced data in the earth such as maps photographs andsatellite imageries Though these representation techniquesoriginated from GIS the underlying coding formats arecommon compared to those for wireless sensor networksas long as they are distributed over a wide spatial area innature Generally spatial data represents geographic featuresin complete and relative locations Attribute data representsthe spatial features in characteristics which can be in quantityandor in quality in real world Attribute data is often referredto as tabular data In our experiments we test both typesof data models versus different clustering algorithms for athorough investigation
31 Spatial Data Model In early days spatial data is storedand represented in a map format There are three fundamen-tal types of spatial data models for recording the geographicdata digitally They are vector raster and image
Figure 1 as shown in the following illustrates the encodingtechniques of two important spatial data [19] which are rasterand vector over a sample aerial image of Adriatic Sea andcoast in Italy The image type of encoding is very similar toraster data in terms of usability of techniques But it has a limitin internal formats when it comes to modeling and analysisof the data Images represent photographs or pictures in thelandscape in a coarse matrix of pixel values
32 Vector Data Model The three kinds of a forementionedspatial data models are used in storing the geographiclocation with spatial features in dataset The vector datamodel uses 119909 119910 coordinates to define the locations offeatures thereafter theymark points lines areas or polygonsTherefore vector data tend to define centers edges andoutlines of features It characterizes the features by linearsegments using sequential points or vertices A vertex consistsof a pair of 119909 and 119910 coordinates The beginning or ending
of a node is defined in each vertex with arc segment Asingle coordinate pair of vertexes defines a feature point Agroup of coordinate pairs define polygonal features In vectorrepresentation as well as the connectivity between featuresthe storage of the vertices for each feature is important as wellas the sharing of common vertices where features connect
By using the same size polygonal we divide a completemap into small units based on the character of our databasewhich is represented to be (119909 119910 V) where 119909 and 119910 consistof an coordinate pair that represents the referenced spatialposition and V represents something of interest or just calledldquofeaturerdquo which could be a military target a critical resourceor just an inhabitant clan for example The greater the V themore valuable the feature is In spatial grouping formaximumcoverage we opt to include these features that amount to ahighest total value A sample of vector format that representsa spatial location in reference to 2D is shown in Figure 2 [19]
33 RasterDataModel Raster datamodelsmake use of a gridof squares to define where features are located These squareswhich are also called pixels or cells typically are of uniformsize
From our dataset we separate the whole image by impos-ing a grid on it hence producing many individual featureswith one feature corresponding to each cell We considerusing raster data model to represent the dataset and we storethe features by the following two different encoding formats
(1) Raster data are stored as an ordered list of cell valuesin pairs of (119894 V) where 119894 is a sequential number of thecell indices and V is the value of the 119894th feature forexample (1 80) (2 80) (3 74) (4 62) (5 45) and soon as shown in Figure 3
(2) Raster data are stored as points (119909 119910 V) with 119909 and 119910
as position coordinates locating to the correspondingspatial feature with value V for example (1 1 513) (12 514) (1 3 517) (2 1 512) (2 2 515) and so on asshown in Figure 4 In this case the value V refers tothe center point of the grid cell This encoding will beuseful for representing measured values at the centerpoint of the cell for example raster of elevation
(3) During the experiment the grid size is transformedfor efficient operation So we put 1198942 cells together asone unit representing one new grid cell as shown inFigure 5
In particular the quad tree data structure for storingthe data is found to be useful as an alternative encodingmethod to raster data model Raster embraces digital aerialphotographs imagery from satellites digital pictures or evenscanned maps Details on how different sorts of objects likepoint line polygon and terrain are represented by the datamodels can be found in [19ndash21]
4 Proposed Methodology
Theaim of themethodology is to determine a certain numberof clusters and their corresponding locations from some
4 International Journal of Distributed Sensor Networks
Real world Vector Rasterimage
Figure 1 Representation of how a real-world spatial area is represented by vector and raster encoding formats
13121110
987654321
0 1 2 3 4 5 6 7 8 9 10 11 12 13Width
Hei
ght
Colu
mn
Row
x y coordinates are 9 3
Figure 2 Vector format
80 74 62 45 45 34 39 56
80 74 74 62 45 34 39 56
74 74 62 62 45 34 39 39
62 62 45 45 34 34 34 39
45 45 45 34 34 30 34 39
Figure 3 Raster format in ordered list
collected spatial data In this process different methods aretested for choosing the one which covers themost area as wellas the highest feature values from the suggested clustersTheflow of this process including preprocessing of sensor data
515 519 521 523
523
523517 528 527
512 510 520
518511 512 516
514
517
511
510
512 516 517 520513
Figure 4 Raster data with center point
80
80
74
62
45
74
74
74
62
45
62
74
62
45
45
45
62
62
45
34
45
45
45
34
34
34
34
34
34
30
39
39
39
34
34
56
56
39
39
39
Figure 5 Raster format with 22 and 32 grids
data transformation clustering and finding cluster center-points is shown in Figure 6
In case of a satellite image or image captured by fighter-jetor other surveillance camera image processing is needed to
International Journal of Distributed Sensor Networks 5
Load spatial image
RGB image
Gray imageSkeleton extraction
Morphological operation in MATLAB (Bwmorph)
Zhangrsquos algorithmare used forcomparison
Two-tone image
Indexed grid image
2D special data
Method comparison
Griddingindexing image
Numerical dataset (with normalization)
Spatial grouping
Hierarchical K-means LPDBScan
Color map
Output
2 algorithms
middot middot middot
Preprocessing of imageData transformation
GroupingDisplay
Figure 6 Workflow of proposed methodology
extract the density information from pictures But in our caseof sensor network we can safely assume that the data fed fromanet of sensorswould have the sensor ID attachedThe sensorIDs are known so are their positions From the locations ofthe sensors and their sensor ID we could possibly relate thedata that was collected to their corresponding locations inthe 119909-119910 format of coordinates (assume the terrain is of 2D)In order to reduce the huge amount of calculation and storagespace a grid was used to divide the whole map into smallerpieces The grid indexing operation is repeated for a range ofdifferent coarse layers thereby providing different resolutionsof data partitions Similar technique is reported in [22] whichis computed by Euclidian distance Obviously the methodof grid indexing helps separate data into cells based on theirgeographic locations
To obtain a better result of spatial groups for maximumcoverage and its corresponding cluster center point with cer-tain constrains the research adopts several popular cluster-ing methods and linear programming method by using soft-ware programs such as XLMiner (httpwwwsolvercomxlminer-data-mining) MATLAB (httpwwwmathworkscomproductsmatlab) and Weka (httpwwwcswaikatoacnzmlweka)
The core purpose of cluster analysis is to comprehend andto distinguish the extent of similarity or dissimilarity amountof the independent clustered objects There are five majormethods of clusteringmdashKM EM XM HC and DBScan
119870-means (KM) byMacQueen 1967 is one of the simplestalgorithms that solve thewell-known clustering problem [23]It is an easy and simple method to divide a dataset into acertain number of clusters initially assuming that the numberof clusters is 119896 fixed a priori for each cluster which is themainidea The random choice of the initial location of centroids
leads to various results A better choice is to place them asmuch far away from each other as possible
The KM algorithm aims at minimizing an objective fun-ction In this case a squared error function is as follows
119895 = sum
forall119894
sum
forall119895
10038171003817100381710038171003817119909119894(119895) minus 119888
119895
10038171003817100381710038171003817
2
(1)
where 119895 ranges from 1 to 119896 119894 range from 1 to 119899 and119909119894(119895) minus 119888
1198952 is a chosen distance measure between a data
point 119909119894(119895) and the cluster center 119888
119895 which is an indicator of
the distance of the 119899 data points from their respective clustercenters The sum of distances or sum of squared Euclideandistances from the mean of each cluster is a quite normal orusual measure for causing scattering in all directions in thecluster in order to test the suitability of the KM algorithmClusters are often computed using a fast heuristic methodwhich generally produces good (but not necessarily optimal)solutions
X-Means [24] is an optimal method of KM whichimproves structure part in the algorithm Division of thecenters is attempted in its region It makes decision betweenthe root and children of each center to doing the comparisonbetween the two structures Another improved variant ofKM called EM which execrates maximization makes anassignment on a probability distribution to each further pointwhich represents the probability Howmany clusters to be setup are to be decided by EM using cross-validation
Density-based algorithms regard clusters as dense areasof objects that are separated by less dense areas [25] Becausethey have no limit to look for clusters with spherical shapethey can produce clusters with arbitrary shapes DBScan isa typical implementation of density-based algorithms calleddensity-based spatial clustering of applications with noise
6 International Journal of Distributed Sensor Networks
(a) (b) (c)
Figure 7 Illustration of possible ways of assigning clusters for maximum (a) fish population (b) altitude of terrain and (c) human inhabitantpopulation
[25] The notions of density reachability and density con-nectivity are used as performance indicators for the qualityof clustering [26] A cluster is composed of the group ofobjects in a dataset that are density connected to a particularcenter Any object that falls beyond a cluster is considered asnoise
Ward proposed a clustering method called hierarchicalclustering (HC) in 1963 [27] It tries to find how to formsomething to divide 119875
119899 119875119899minus1
1198751in a way that reduces
the relationship with each group In each step analysisstep it considered every possible cluster pair in group andcombined the two clusters with a very close joining of resultsin ldquoinformation lossrdquo which is given definition by Wardaround ESS (an error sum-of-squares criterion)The idea thatsupports Wardrsquos proposal can be described most simply bythinking of a little single data Take ten objects with scores asan example (2 7 6 6 7 2 2 0 2 0) The loss of informationwould be achieved by calculating ESS with a mean of 34which takes into account the ten scores as a unit as followsESS One group = (2 minus 34)
2+ (7 minus 34)
2+ sdot sdot sdot + (0 minus 34)
2=
4728 However those 10 objects can also be separated intofour groups according to their scores 0 0 0 2 2 2 26 6 and 7 7 Finally for evaluation of the ESS as a sum ofsquares we can obtain four independent error sums of eachsquare Overall the result that divides the 10 objects into 4clusters has no loss of information as follows
ESS One group = ESS group1 + ESS group2
+ ESS group3 + ESS group4 = 0
(2)
The last method we adopted here is linear programming(LP) which contains instituting and producing an answerto optimization problems with linear objective functionsand linear constraints This powerful tool can be used inmany fields especially where many options are possible inthe answers In spatial grouping over a large grid manypossible combinations of positioning the clusters exist Theproblem here is to find a certain number of clusters of
equal size over the area meanwhile the chosen centers ofthe clusters must allow sufficient distance apart from eachother so as to avoid overlapping As an example shownin Figure 7 three clusters would have to be assigned overa spatial area in a way that they would have to covercertain resources The assignment of the clusters howeverwould have to yield a maximum total value summed fromcovered resources In the example the left diagram showsallocating three clusters over the deepwater assuming that theresources are fish hence the maximum harvest The secondexample in the middle of Figure 7 is clustering the highaltitude over the area The last example is trying to coverthe maximum human inhabitants which are concentratedat the coves Given many possible ways of setting up theseclusters LP is used to formulate this allocation problemwith an objective of maximizing the values of the coveredresources
Assuming that the resources could be dynamic forexample animal herds or moving targets whose positionsmay swarm and change over time the optimization is atypical maximal flow problem (or max flow problem) Theoptimization is a type of network flow problem in whichthe goal is to determine the maximum amount of flowthat can occur over arc whish is limited by some capacityrestriction This type of network might be used to modelthe flow of oil in pipeline (in which the amount of oil thatcan flow through a pipe in a unit of time is limited by thediameter of the pipe) Traffic engineers also use this type ofnetwork to determine the maximum number of cars that cantravel through a collection of streets with different capacitiesimposed by the number of lanes in the streets and speed limits[28]
For our spatial clustering we consider each cell of the gridas a node each node is defined as a tuple119898 that contains thecoordinates and the value of the resource that is held in thenode such that 119898(119909
119894 119910119894 119911119894) represents an 119894th node in which
119909119894 119910119894represent the position and 119911
119894represents the value of
resource in the node respectively For the clusters each node
International Journal of Distributed Sensor Networks 7
(1) Load the grid-based spatial information into array 119860(119909 119910 119911) 119860 is a three dimensional array(2) Repeat (through all coordinates of 119909)(3) Repeat (through all coordinates of 119910)(4) If (boundary constraints and overlapping constraints are satisfied) Then(5) 119878(119909
119894 119910119894 119911119894) = 119860(119909
119894 119910119894 119911119894)
(6) End-if(7) End-loop(8) End-loop(9) If size of (119878) ge 119870
(10) Find top 119870 clusters where maxsum119911119894⨁119862119896 copy 119878(119909
119894 119910119894 119911119894) to new array 119862(119909
119894 119910119894 119911119894) forall119894 isin 119862
119896
(11) Else-if(12) 119862(119909
119894 119910119894 119911119894) = 119878(119909
119894 119910119894 119911119894) forall119894
(13) End-if
Pseudocode 1 Pseudocode of the proposed LP model for spatial clustering
can potentially be a center of a cluster and the cluster hasa fixed radius of length 119903 The LP model for our problem ismathematically shown as follows
Total value = ⋃
selected clusters ⟨119862119896|119896=1sdotsdotsdot119870⟩sum
119898119894isin119862119896
119898119894(lowast lowast 119911
119894)
= argmax119883119884
sum
0le119909119894le119883
0le119910119894le119884
119870
sum
119896=1
119911119897ni 119898119897(119909119894 119910119895 119911) oplus 119888
119896
(3)
Subject to the boundary constraints of 2r le |119909119894minus 119909119895| and 2r
le |119909119894minus 119909119895| for all 119894 and 119895 but 119894 = 119895 where119883 is the maximum
width and 119884 is the maximum length of the 2D spatial arearespectively 119896 isin 119870 is the maximum number of clusters and119888119896is the 119896th cluster under consideration in the optimizationIn order to implement the computation as depicted in
(3) for each node we sum each group resources in a shapeof diamond (which geometrically approximates a circle) Byiterating through every combination of119870 nodes in the grid ofsize 119883 by 119884 each current node in the combinations is beingtested by considering it as the center of a cluster that has aradius of r hence storing the resource values of the nodesfrom the potential clusters into a temporary array buffer119860(lowast lowast 119911
119894) The results from those potential clusters which
do satisfy the boundary and nonoverlapping constraints arethen copied to a candidate buffer Out of the clusters whoseresource values are stored in the candidate buffer 119878 thecombination of 119870 clusters that has the great total resourcevalue is selected and their values are placed in the final buffer119862 The corresponding pseudocode is shown in Pseudocode 1
5 Experimental Results and Analysis
In this section the performance of the proposed methodol-ogy is shown by presenting both numerical and visualizedresults for all performance aspects over various algorithms Acase study of road traffic is used in the experimentThe spatialarea is a metropolitan traffic map with roads and streetsspanning all over the place The resource value in this case isthe concentration or density of vehicle traffic flows Sensorsare assumed to have been deployed in every appropriate point
Table 1 Comparison between Bwmorph function and thinningalgorithm
Bwmorph function Thinning algorithmDataset 1 Dataset 2 Dataset 1 Dataset 2
Degree ofthinning Incomplete Complete
Elapsed time(secs) 20 38 100 198
Complexity 119874(119899) 119874(1198992)
of the roads thereby a typical traffic volume is each of thesepoints is known The optimization of spatial clustering inthis case can be thought of as optimal resource allocationfor example cost-effective police patrols gas stations orenvironment-pollution controls are needed among thosedense traffic spots
51 Data Preprocessing Two different factual datasets areused for experiments The first dataset is published byMaricopa Association of Governments in 2008 which isa traffic volume map Traffic volumes were derived fromthe national traffic recording devices Seasonal variation isfactored into the volumes The second dataset is an annualaverage daily traffic of Baltimore County Traffic VolumeMapin 2011 in USA prepared by the Maryland Department ofTransportation and published by March 19 2012 The trafficcount estimates are derived by taking 48-hourmachine countdata and applying factors frompermanent count stationsThetraffic counts represent the resource values in a general sense
After using skeleton extraction a two-tone image wasobtained from the original map Readers are referred to therespective websites where they can see the traffic volume datathat are associated with our two datasets (a) Representativetraffic volume map of dataset 1mdashTraffic Volume Map ofPhoenix AZUSA (httpphoenixgovstreetstrafficvolume-map) (b) Representative traffic volume map of dataset2mdashTraffic Volume Map of Baltimore MD USA (httpwwwmarylandroadscomTraffic Volume MapsTraffic VolumeMapspdf) And the corresponding result skeleton extraction
8 International Journal of Distributed Sensor Networks
Table 2 Important statistics from the clustering and LP experiments
Method Cluster number Number of cells covered Minimum Maximum Overlap
KM
Cluster 1 428 0 3499327 0Cluster 2 468 0 546896 0Cluster 3 448 0 20503007 0Cluster 4 614 0 6894667 0Cluster 5 618 0 900908 0
XM
Cluster 1 615 0 591265 0Cluster 2 457 0 546896 0Cluster 3 609 0 900908 0Cluster 4 465 0 3499327 0Cluster 5 430 0 20503007 0
EM
Cluster 1 1223 0 2292 61817229Cluster 2 7 141048 243705 313018Cluster 3 81 0 3033733 131146577Cluster 4 64 26752 546896 330881249Cluster 5 1201 0 1300026 217950471
DB
Cluster 1 13 23614 33146 327222911Cluster 2 11 1686825 21001 363965818Cluster 3 13 178888 2945283 196118393Cluster 4 11 847733 211008 58940877Cluster 5 2528 0 546896 20554176
HC
Cluster 1 291 0 3499327 0Cluster 2 191 0 20503007 96762283Cluster 3 294 0 1590971 0Cluster 4 224 0 189812 12673555Cluster 5 243 0 546896 0
LP
Cluster 1 221 0 3499327 0Cluster 2 221 0 20503007 0Cluster 3 221 0 1590971 0Cluster 4 221 0 189812 0Cluster 5 221 0 546896 0
Table 3 Comparison for running time of the first dataset
Formats KM HC DBscan XM EM LPVector database 327 1252 2324 278 930 183Raster database 342 1536 2820 284 984 201RasterP (16 grids) 198 134 508 046 057 078RasterP (25 grids) 009 014 115 021 012 053
in dataset 1 is shown in Figure 8 where (a) adopted a kind ofmorphological operation method and (b) adopted thinningalgorithm respectively Likewise the corresponding resultskeleton extraction in the second dataset is shown inFigure 9 where (a) adopted a kind of morphologicaloperation method and (b) adopted thinning algorithmrespectively The comparison result of the two datasets isshown in Table 1
For the raw dataset we firstly perform the image prepro-cessing over it to obtain numerical database
The results of the skeleton extraction as shown in Figures8(b) and 9(b) are more clearly and useful for the following
processing Subsequently the clustering by grid can bereadily obtained from the preprocessed imagesThe extent ofimage thinning is better and more complete by the thinningalgorithm than the Bwmorph function in MATLAB But theelapsed time is longer due to a two-layer iteration nestingprocedure in the program code
The choice of placing a grid on the image follows oneprinciple mesh segmentation is not trying to fall on a con-centrated position of traffic flow Since there is no endpointthe midpoint of the two adjacent values was considered ademarcation point Under this assumption the traffic flow ineach grid is calculated and stored digitally in an Excel file Adigital data for the trafficmap serves as the initial data for thesubsequent clustering process
52 Comparison Result of KM and HC Clustering InXLMiner two methods were used to perform clustering KMand HC In order to compare the two methods for the twodatasets input variables were normalized and the numberof clusters is set at five and maximum iterations at 100 Theinitial centroids are chosen randomly at start Furthermore
International Journal of Distributed Sensor Networks 9
Table 4 Comparison for log-likelihood of first dataset
Formats KM HC DBScan XM EMVector database minus1241868 minus1407265 minus1328599 minus119533 minus1249562Raster database minus1342238 minus1502863 minus1378889 minus129632 minus1339769RasterP (16 grids) 1262264 minus1402266 minus1248583 minus1239419 minus1244993RasterP (25 grids) minus1241868 minus1319417 minus1122207 minus1248201 minus1162048
(a) (b)
Figure 8 (a) Result of skeleton extraction in dataset 1 using Bwmorph function (b) Result of skeleton extraction in dataset 1 using thinningalgorithm
(a) (b)
Figure 9 (a) Result of skeleton extraction in dataset 2 using Bwmorph function (b) Result of skeleton extraction in dataset 2 using thinningalgorithm
Table 5 Comparison for running time of the second dataset
Formats KM HC DBScan XM EM LPVector database 139 134 1553 153 1005 337Raster database 241 1478 1834 217 823 196RasterP (16 grids) 047 801 1274 045 377 144RasterP (25 grids) 035 620 1098 036 296 118
the weights for the corresponding three attributes (119909 119910 V)for each grid (119892
119894= (119909119894 119910119894 V119894)) based on defining weight of
119909 and 119910 could be varied (fine-tuned) and the sum of weightsmust be equal to 1 We tested several variations searching forthe best clustering results (1) weight of V is 20 (2) weightof V is 40 (3) weight of V is 50 (4) weight of V is 60 (5)weight of V is 80 (6) all of them have same weight at 333
(7) weight of V is 0 (8) same weight except when 119892119894(V119894= 0)
and (9) weights of 119909 and 119910 are both 0 except when 119892119894(V119894= 0)
In HC method normalization of the input data waschosen Another option available is similarity measure Itadopts Euclidean distance to measure raw numeric dataMeanwhile the other two options Jaccardrsquos coefficients andmatching coefficient are activated only when the data isbinary
For the above nine cases results of cases (1) to (6) aresimilar in their separate methods And result of (9) is theworst which does not accomplish any clustering Results ofcases (2) (3) (7) and (8) are demonstrated in Figure 10
For the distribution of clusters in the result of KMclustering method more than half of data points are clampedinto one oversized cluster The result of this method istherefore not helpful for further operation For HC methoddata on average are allocated into separate clustersThe result
10 International Journal of Distributed Sensor Networks
Table 6 Comparison for log-likelihood of second dataset
Formats KM HC DBScan XM EMVector database minus1735412 minus1962367 minus1753576 minus1721513 minus1657263Raster database minus1815926 minus2012568 minus1970756 minus1815791 minus1848209RasterP (16 grids) minus1551437 minus1724736 minus1637147 minus1701283 minus1566231RasterP (25 grids) minus1484761 minus1663789 minus1509146 minus1667312 minus1647823
1
2
5
4
41
3
32
5
(a)
1
2 3
4
5
4 51
3
2
(b)
5
5
4
1
4
1 3
3
2
2
5
(c)
2
2
4
3
3
1 5
5
4
1
(d)
Figure 10 (a) Clustering results for the first dataset with setting case (2) where weight of V is 40 top half uses KM clustering methodand bottom half uses HC method (b) Clustering results for the first dataset with setting case (3) where weight of V is 50 top half uses KMclustering method and bottom half uses HC method (c) Clustering results for the first dataset with setting case (7) where weight of V is 0top half uses KM clustering method and bottom half uses HC method (d) Clustering results for the first dataset with setting case (3) whereall share the same weight except 119892
119894(V119894= 0) top half uses KM clustering method and bottom half uses HC method
Table 7 Comparison of running time (in seconds) of four differentsizes of dataset
Dataset size KM HC DBScan XM EM LP100 grid cells 006 007 105 219 321 0184600 grid cells 042 295 3989 273 1905 93710000 grid cells 262 4667 9755 297 3785 242180000 grid cells 1975 18961 684 647 19831 9083
in Figure 10(c) is the best showing only the one with distinctposition attributes (119909 and 119910) The other three results (Figures10(a) 10(b) and 10(d)) are stained with cluster overlapsTherefore allocation of critical resource for example in eachcluster may result in a waste of resources The degree ofoverlap is the least in the result of Figure 10(b) If only locationis being considered the result of Figure 10(c) is the bestchoice Otherwise the result in Figure 10(b) is better than theother two for the sake of cluster distribution
The clustering results of the second dataset performanceby using the two methods KM and HC are shown inFigure 11
From the results of the cluster distribution of the seconddataset obtained by both clustering methods the size of eachcluster is more or less similar which is better than that ofthe first dataset And there is no overlap phenomenon inthe KM results This is a promising feature of KM methodfor spatial clustering However there is little overlap in theresult of HC method as the clusters seem to take irregularshapes Above all for the second dataset KM is a better choicefor consideration of even cluster distribution and overlapavoidance by using both clustering methods
53 Results of Grouping In this part we compare the coloredmap of Raster (119909 119910 V) data model in two datasets usingfive clustering methods in Weka and the LP method Thecommon requirement is no overlap for each of the resultingmaps The number of cluster is arbitrarily chosen at five The
International Journal of Distributed Sensor Networks 11
Table 8 Numeric results of coverage of each cluster by using the six methods for dataset 1
Cov-db1 KM EM DBScan XM HC LPCluster 0 0029436 0003786 0017902 0075178 0013153 0028985Cluster 1 0301538 0269602 0208078 0049761 0026016 0377034Cluster 2 0215277 0001627 0158439 0084049 012436 0080099Cluster 3 0046788 0096221 0079177 0209390 0001172 0217204Cluster 4 0002712 0161799 0044197 0043152 03043 0007704Total coverage 0595751 0533036 0507793 0461531 0469 0711025
41 5
3 2
(a)
4
13
5
2
(b)
Figure 11 (a) Clustering results for the second dataset by usingKMmethod (b) Clustering results for the second dataset by usingHCmethod
result of first dataset is shown in Figure 12 The first part (i)of Figure 12 shows the spatial clustering result the secondpart (ii) visualizes the corresponding spatial groups by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method The centers of the clustersare computed after clustering is done and then the groupsare visualized over the clustering results according to thecomputed centers
In Figure 12 for the results of (a) and (e) the sizes ofclusters are quite uneven more than half of dataset fall intoone cluster Thus this result reveals a fact that the techniquecannot organize a dataset into homogeneous andor well-separated groups with respect to a distance or equivalentlya similarity measureThe corresponding groups have overlapphenomenon too For the result of (c) the sizes of the clustersare uneven too For the result of (b) and (d) the sizes of clusterseem to be similar to each other There is also no overlapin the clustering result but for group result the groups in(d) have far more overlaps than those in (b) Overlap meanssome part or the cluster gets in the way of another onewhichmeans that there is superposition between two ormoredifferent clusters Again itmay cause resourcewaste and evenfalse allocation This situation occurs in important fields ofapplications such as information retrieval (several thematicfor a single document) and biological data (several metabolicfunctions for one gene) For this reason (b) is better than (d)According to the above analysis for the result of clusteringand corresponding groups (d) XM is so far the best choiceof clustering algorithm as evidenced by the colored mapsthereafter
With the same experiment setup and operating environ-ment the spatial clustering experiments are performed overthe second dataset The results of second dataset are shown
in Figure 13 where (i) represents the spatial clustering resultand (ii) represents the corresponding spatial group by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method
In Figures 13(a) and 13(e) it is noticed that the clustersare imbalanced and there are overlaps in the correspondingspatial groups using the method of (a) EM and (e) DBScanThe results of (b) KM and (d) XM however avoid theshortcomings of (a) and (e) though they still have slightoverlaps For (c) HC we remove the empty cells in theboundary to reduce the size of dataset the clustering resultis perfect There is no overlap and clusters are balancedbetween each other But there is still overlap in the spatialgroups Thus LP method is adopted to solve this problemand in possession of same size of groups The result of LPmethod yields perfectly balanced groups without any overlapas shown in Figure 13(f)
By visually comparing the clustering results of the twodatasets the clustering results seem to be similar but thespatial groups are somewhat differentOccurrence of overlapsin spatial groups is more severe in the first dataset than in thesecond one The overlaps are likely due to data distributionand balance in sizes between each cluster For the first datasetthe heavy spatial values which are traffic volumes in this caseare mainly concentrated in the center region (city center) solocations of the computed clusters tend to cram very near ata crowded spot In contrast the traffic volumes in the seconddataset are dispersed over a large area As seen from the visualspatial groups of the second dataset the cluster positions area little far apart when compared to those in the first dataset
Based on the results generated from the clustering andLP experiments some statistic information of dataset 2 iscollected and it is shown in Table 2 The numeric results in
12 International Journal of Distributed Sensor Networks
(i)
(ii)
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(ii)
(i)
(d)
(i)
(ii)
(e)
Figure 12 (a) (i) Spatial clustering on dataset 1 by using EM (ii) Spatial groups on dataset 1 from the results of using EM (b) (i) spatialclustering on dataset 1 by using KM (ii) Spatial groups on dataset 1 from the results of using KM (c) (i) spatial clustering on dataset 1 byusing HC (ii) Spatial groups on dataset 1 from the results of using HC (d) (i) spatial clustering on dataset 1 by using XM (ii) Spatial groupson dataset 1 from the results of using XM (e) (i) spatial clustering on dataset 1 by using DBScan (ii) Spatial group on dataset 1 from theresults of using DBScan
Table 3 support the qualitative analysis by visual inspectionin the previous section By comparingHC and LPmethods asan example the quantitative results show that they have thegreatest differences in cell numbers covered by the clustersalso the amount of overlap in HC is the highest of all By theLP method the size of each cluster is exactly the same andthey are totally free from overlap
6 Technical Analysis of Clustering Results
61 Experimental Evaluation Method For the purpose ofassessing how the qualities of spatial groups from clusteringare several evaluation factors are defined here running time(short for time) balance log-likelihood overlap densityand coverage For a fair comparison the datasets are run inthe same software environment of the same computer And
assume the number of groups to be five with six differentmethods Running time is the time we used to run eachmethod using the same software in the same computerto completion Balance is used to measure the sizes ofgroups if balanced the size of each group is the same Log-likelihood is an important measure for clustering qualitythe bigger the value the better Weka tests for goodness-of-fit by the likelihood in logarithm which is called log-likelihood A large log-likelihood means that the clusteringmodel is suitable for the data under test Overlap means thatthe spatial values (eg traffic volumes sensed by the sensors)do belong to more than one cluster Density is the averagespatial values (traffic volumes) per grid cell in each clusterCoverage of a cluster means the proportion of traffic volumesthat are covered by the grid cells within the cluster overthe whole dataset meanwhile total coverage is the sum of
International Journal of Distributed Sensor Networks 13
(i)
(ii)
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(i)
(ii)
(d)
(i)
(ii)
(e)
(f)
Figure 13 (a) (i) spatial clustering on dataset 2 by using EM (ii) Spatial groups on dataset 2 from the results of using EM (b) (i) spatialclustering on dataset 2 by using KM (ii) Spatial groups on dataset 2 from the results of using KM (c) (i) spatial clustering on dataset 2 byusing HC (ii) Spatial groups on dataset 2 from the results of using HC (d) (i) spatial clustering on dataset 2 by using XM (ii) Spatial groupson dataset 2 from the results of using XM (e) (i) spatial clustering on dataset 2 by using DBScan (ii) Spatial group on dataset 2 from theresults of using DBScan (f) Spatial group in LP method on dataset 2
14 International Journal of Distributed Sensor Networks
traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below
Density (cluster 119894) =sumTraffic Volumes (cluster 119894)
Grid Cell Number (cluster 119894)
Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)
sumGrid Cell Number
Total Coverage = sumTraffic Volumes minusOverlaps
Proportion of Cluster (119894) Size (Balance)
=Grid Cell Number (cluster 119894)
sumGrid Cell Number
(4)
62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively
According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest
In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one
In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14
According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On
0
100
200
300
400
500
600
700
800
900
0 20000 40000 60000 80000 100000
K-meansHierarchicalDBScanXMeanEM
Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP
K-means)
Figure 14 Comparison of running time (in seconds) of differentsizes of dataset
the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally
In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one
In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC
The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets
From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage
International Journal of Distributed Sensor Networks 15
Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2
Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700
0
01
02
03
04
05
06
07
08
KM EM DBScan XM HC LP
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
CoverageCoverage
(a)
0
01
02
03
04
05
06
KM EM DBScan XM HC LP
Coverage
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
Coverage
(b)
Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2
Total density
0
200
400
600
800
1000
1200
1400
1600
KM EM DBScan XM HC LP
Density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(a)
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
KM EM DBScan XM HC LP
Density
Total density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(b)
Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2
16 International Journal of Distributed Sensor Networks
4
5136
8
1
Balance test on dataset 1
(a) KM
1
50
1
18
30
Balance test on dataset 1
(b) XM
6
22
24
30
18
Balance test on dataset 1
(c) EM
24
24
17
20
15
Balance test on dataset 1
(d) DBScan
18
17
22
19
25
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP
From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means
that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered
International Journal of Distributed Sensor Networks 17
17
18
17
24
24
Balance test on dataset 2
(a) KM
24
18
24
18
17
Balance test on dataset 2
(b) XM
47
032
47
Balance test on dataset 2
(c) EM
1010
98
Balance test on dataset 2
(d) DBScan
23
15
24
18
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP
The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely
63 Discussion of G119899119890119905 For all the six evaluation factors each
of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in
18 International Journal of Distributed Sensor Networks
Table 10 Numeric results of density of each cluster by using the six methods for dataset 1
Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049
Table 11 Numeric results of density of each cluster by using the six methods for dataset 2
Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447
order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866
119888can take a relatively very large value or even
1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes
119866119897=
10038161003816100381610038161003816100381610038161003816
LikelihoodTime
10038161003816100381610038161003816100381610038161003816
(5)
119866119887=Difference of Balance
Time (6)
119866119889=DensityTime
(7)
119866119888=CoverageTime
(8)
119866119900=OverlapTime
(9)
119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)
Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)
From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in
Table 12 which allows us to easily compare various methodsand performance aspects
In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866
119897 119866119887 119866119889 119866119888 and 119866
119900are computed
for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13
According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion
7 Conclusion and Future Works
Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for
International Journal of Distributed Sensor Networks 19
Table 12 Performance indicators of the six methods based on dataset 2
Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0
Table 13 Comparison of different clustering and LP methods by119866net indicator
Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132
purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users
The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes
For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be
good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed
References
[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000
[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012
[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012
[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012
[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012
[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003
[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006
[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002
[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008
[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006
[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004
[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998
[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000
20 International Journal of Distributed Sensor Networks
[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004
[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005
[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010
[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998
[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006
[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999
[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001
[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009
[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999
[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000
[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996
[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963
[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
2 International Journal of Distributed Sensor Networks
determines on how we should distribute the sensors butalso after the deployment how these logical groups are beingformed possibly for further applications
For experiments we attempted to apply several clusteringalgorithms the choices of these algorithms are classical andpopular in data mining research community The effec-tiveness of different clustering algorithms is measured forcomparison However none of the clustering algorithms canachieve the best results At the end we develop a simple andnovelmethod based on linear programming for optimizationwhich we called LP LP is shown to be able to achieveoptimal grouping over different configurations and cases ofexperiments
The contribution of the paper is an in-depth investiga-tion into the grouping problem that arises right after thedeployment of a ubiquitous sensor network We propose anovel solution to achieve optimal groups by using linearprogramming though several clustering algorithms havebeen put into test
The remaining of the paper is structured as followsSection 2 introduces the background techniques of spatialclustering Section 3 surveys on spatial data representa-tion how the spatial data are encoded for postprocessingSection 4 describes our methodology for obtaining optimalgroups over spatial data Section 5 reports about the exper-iments Section 6 analyses and compares the experimentalresults Section 7 concludes the paper
2 Overview of Spatial DataClustering Techniques
Clustering is the organization of a dataset into homogeneousandor well-separated groups with respect to a distance orequivalently a similarity measure [6] Spatial data clusteringhas numerous applications in pattern recognition [7 8]spatial data analysis [9ndash11] market research and so forth [1213] which gather data to find all non concentrate models andspecial things among geographical dataset And spatial dataclustering is an important instrument of spatial data miningwhich has become a powerful tool for efficient and complexanalysis of huge spatial databases [12]with geometric featuresand is liked by most people Conventional methods ofclustering algorithm are classified into four types partitionmethod hierarchical method density-based method andgrid-based method
However with the extension of research objects andscope it has been discovered to have shortcomings Manyexisting spatial clustering algorithms cannot cluster withirregular obstacles reliably A grid-density-based hierarchicalclustering (HC) algorithm is proposed to tackle this problemThe advantage of grid-based clustering algorithm is reducingthe quality of calculation An alternative approach [14] isproposed that can effectively form clustering in the presenceof obstacles The shapes of the clusters can be arbitrarilydefined Moreover the hierarchical strategy is used to reducethe complexity in presence of obstacles and constraints andto improve the operation efficiency [13] And the result isthat it can deal with spatial clustering while it faces obstacles
and constraints and get better performanceWhen some datapoints do not work in any cluster for density clustering thissituation was managed by using grid-based HC instead inthis study And each clustering algorithm has its individualadvantages and disadvantages
Thepartition approach separates119873objects into119898 groupsmeanwhile 119898 satisfies the following constraints firstly eachgroup contains one object at least secondly each objectmust belong to one group In order to achieve a globaloptimum in grouping it is necessary to list all possibleclusters and most applications will adopt KM K-medoidor fuzzy analysis However this partitioning method consistsof some problems when applied to spatial mining towardsclustered objects especially the objects that are obstructedby some environment conditions Such as river it is hard torecognize the comparability
This method can be efficiently carried out by clusteringno matter how large the number of objects Cluster analysisalgorithm in general cannot deal with large datasets It isrecommended that the maximum number of objects to dealwith in this method should be no more than 1000 It is astochastic huntingway based on partitioning in the clusteringmethod due to its low efficiency and the capability of thismethod is much affected by the random selection of thestochastic initial value [14]
HC is another popular clustering method which ismore flexible than partitioning-based clustering but it has ahigher time complexity HC algorithms create a hierarchicaldecomposition (aka dendrogram) of dataset based on somecriterion [15 16] According to the rule of generation inhierarchical decomposition there are two different types ofHC methods agglomerative and divisive For agglomerativealgorithm it starts with producing leaves and it combinesclusters in a bottom-up way For divisive algorithm theclustering starts at the root and it recursively separates theclusters in a top-down way The process continues until astopping criterionmdashusually it stops when the required 119896
clusters are accomplished However this hierarchical methodconsists of some problems vagueness of termination criteriaonce a step is complete it cannot be revoked As long as theneighborhood density (the number of objects or data points)does not grow over a certain threshold clustering continues[17] In other words for a given point in each cluster in theneighborhood of a given radius it must contain at least aminimum number of points As a result the noisy data canbe filtered and better clusters with arbitrary shape can befound DBScan and its expansion algorithm which is calledOPTICS are two of these classical density-based methodsThey perform clustering based on the type of density-basedconnectivity
Grid-based method quantifies object space to a restrictednumber of cells forming a grid structure All clusteringoperations are in the grid structure (ie quantitative space)The main benefit of this method is its high speed therun time is usually not restricted by the data size whichonly depends on the number of cells in each dimensionThe algorithm STING (statistical information grid-basedmethod) [18] works with numerical attributes (spatial data)and is designed to facilitate ldquoregion-orientedrdquo queries
International Journal of Distributed Sensor Networks 3
Nevertheless the spatial groups obtained by classic algo-rithms have certain limitations that is overlaps cannot becontrolled and the maximum coverage by the resultantgroups is not guaranteed Overlaps lead to resource wasteand potentially resourcemismatch Besides spatial clusteringthis situation occurs in other fields of applications suchas the information retrieval (several thematic for a singledocument) biological data (several metabolic functions forone gene) and martial purpose (discover object densenessregions independently) However there has been no studyreported in the literature that the authors are aware ofthat applies LP method to discover spatial groups that arefree of the limitations inherited from clustering algorithmsThus this research provides an alternative method to achievespatial groups for maximum coverage in real environmentMaximum coverage in this context here is defined as thegreatest possible area of effect covered by the spatial groupswith none or minimum overlaps among the groups
3 Spatial Data Representation
Two main categories of spatial data representation existspatial data and attribute data Spatial data means refer-enced data in the earth such as maps photographs andsatellite imageries Though these representation techniquesoriginated from GIS the underlying coding formats arecommon compared to those for wireless sensor networksas long as they are distributed over a wide spatial area innature Generally spatial data represents geographic featuresin complete and relative locations Attribute data representsthe spatial features in characteristics which can be in quantityandor in quality in real world Attribute data is often referredto as tabular data In our experiments we test both typesof data models versus different clustering algorithms for athorough investigation
31 Spatial Data Model In early days spatial data is storedand represented in a map format There are three fundamen-tal types of spatial data models for recording the geographicdata digitally They are vector raster and image
Figure 1 as shown in the following illustrates the encodingtechniques of two important spatial data [19] which are rasterand vector over a sample aerial image of Adriatic Sea andcoast in Italy The image type of encoding is very similar toraster data in terms of usability of techniques But it has a limitin internal formats when it comes to modeling and analysisof the data Images represent photographs or pictures in thelandscape in a coarse matrix of pixel values
32 Vector Data Model The three kinds of a forementionedspatial data models are used in storing the geographiclocation with spatial features in dataset The vector datamodel uses 119909 119910 coordinates to define the locations offeatures thereafter theymark points lines areas or polygonsTherefore vector data tend to define centers edges andoutlines of features It characterizes the features by linearsegments using sequential points or vertices A vertex consistsof a pair of 119909 and 119910 coordinates The beginning or ending
of a node is defined in each vertex with arc segment Asingle coordinate pair of vertexes defines a feature point Agroup of coordinate pairs define polygonal features In vectorrepresentation as well as the connectivity between featuresthe storage of the vertices for each feature is important as wellas the sharing of common vertices where features connect
By using the same size polygonal we divide a completemap into small units based on the character of our databasewhich is represented to be (119909 119910 V) where 119909 and 119910 consistof an coordinate pair that represents the referenced spatialposition and V represents something of interest or just calledldquofeaturerdquo which could be a military target a critical resourceor just an inhabitant clan for example The greater the V themore valuable the feature is In spatial grouping formaximumcoverage we opt to include these features that amount to ahighest total value A sample of vector format that representsa spatial location in reference to 2D is shown in Figure 2 [19]
33 RasterDataModel Raster datamodelsmake use of a gridof squares to define where features are located These squareswhich are also called pixels or cells typically are of uniformsize
From our dataset we separate the whole image by impos-ing a grid on it hence producing many individual featureswith one feature corresponding to each cell We considerusing raster data model to represent the dataset and we storethe features by the following two different encoding formats
(1) Raster data are stored as an ordered list of cell valuesin pairs of (119894 V) where 119894 is a sequential number of thecell indices and V is the value of the 119894th feature forexample (1 80) (2 80) (3 74) (4 62) (5 45) and soon as shown in Figure 3
(2) Raster data are stored as points (119909 119910 V) with 119909 and 119910
as position coordinates locating to the correspondingspatial feature with value V for example (1 1 513) (12 514) (1 3 517) (2 1 512) (2 2 515) and so on asshown in Figure 4 In this case the value V refers tothe center point of the grid cell This encoding will beuseful for representing measured values at the centerpoint of the cell for example raster of elevation
(3) During the experiment the grid size is transformedfor efficient operation So we put 1198942 cells together asone unit representing one new grid cell as shown inFigure 5
In particular the quad tree data structure for storingthe data is found to be useful as an alternative encodingmethod to raster data model Raster embraces digital aerialphotographs imagery from satellites digital pictures or evenscanned maps Details on how different sorts of objects likepoint line polygon and terrain are represented by the datamodels can be found in [19ndash21]
4 Proposed Methodology
Theaim of themethodology is to determine a certain numberof clusters and their corresponding locations from some
4 International Journal of Distributed Sensor Networks
Real world Vector Rasterimage
Figure 1 Representation of how a real-world spatial area is represented by vector and raster encoding formats
13121110
987654321
0 1 2 3 4 5 6 7 8 9 10 11 12 13Width
Hei
ght
Colu
mn
Row
x y coordinates are 9 3
Figure 2 Vector format
80 74 62 45 45 34 39 56
80 74 74 62 45 34 39 56
74 74 62 62 45 34 39 39
62 62 45 45 34 34 34 39
45 45 45 34 34 30 34 39
Figure 3 Raster format in ordered list
collected spatial data In this process different methods aretested for choosing the one which covers themost area as wellas the highest feature values from the suggested clustersTheflow of this process including preprocessing of sensor data
515 519 521 523
523
523517 528 527
512 510 520
518511 512 516
514
517
511
510
512 516 517 520513
Figure 4 Raster data with center point
80
80
74
62
45
74
74
74
62
45
62
74
62
45
45
45
62
62
45
34
45
45
45
34
34
34
34
34
34
30
39
39
39
34
34
56
56
39
39
39
Figure 5 Raster format with 22 and 32 grids
data transformation clustering and finding cluster center-points is shown in Figure 6
In case of a satellite image or image captured by fighter-jetor other surveillance camera image processing is needed to
International Journal of Distributed Sensor Networks 5
Load spatial image
RGB image
Gray imageSkeleton extraction
Morphological operation in MATLAB (Bwmorph)
Zhangrsquos algorithmare used forcomparison
Two-tone image
Indexed grid image
2D special data
Method comparison
Griddingindexing image
Numerical dataset (with normalization)
Spatial grouping
Hierarchical K-means LPDBScan
Color map
Output
2 algorithms
middot middot middot
Preprocessing of imageData transformation
GroupingDisplay
Figure 6 Workflow of proposed methodology
extract the density information from pictures But in our caseof sensor network we can safely assume that the data fed fromanet of sensorswould have the sensor ID attachedThe sensorIDs are known so are their positions From the locations ofthe sensors and their sensor ID we could possibly relate thedata that was collected to their corresponding locations inthe 119909-119910 format of coordinates (assume the terrain is of 2D)In order to reduce the huge amount of calculation and storagespace a grid was used to divide the whole map into smallerpieces The grid indexing operation is repeated for a range ofdifferent coarse layers thereby providing different resolutionsof data partitions Similar technique is reported in [22] whichis computed by Euclidian distance Obviously the methodof grid indexing helps separate data into cells based on theirgeographic locations
To obtain a better result of spatial groups for maximumcoverage and its corresponding cluster center point with cer-tain constrains the research adopts several popular cluster-ing methods and linear programming method by using soft-ware programs such as XLMiner (httpwwwsolvercomxlminer-data-mining) MATLAB (httpwwwmathworkscomproductsmatlab) and Weka (httpwwwcswaikatoacnzmlweka)
The core purpose of cluster analysis is to comprehend andto distinguish the extent of similarity or dissimilarity amountof the independent clustered objects There are five majormethods of clusteringmdashKM EM XM HC and DBScan
119870-means (KM) byMacQueen 1967 is one of the simplestalgorithms that solve thewell-known clustering problem [23]It is an easy and simple method to divide a dataset into acertain number of clusters initially assuming that the numberof clusters is 119896 fixed a priori for each cluster which is themainidea The random choice of the initial location of centroids
leads to various results A better choice is to place them asmuch far away from each other as possible
The KM algorithm aims at minimizing an objective fun-ction In this case a squared error function is as follows
119895 = sum
forall119894
sum
forall119895
10038171003817100381710038171003817119909119894(119895) minus 119888
119895
10038171003817100381710038171003817
2
(1)
where 119895 ranges from 1 to 119896 119894 range from 1 to 119899 and119909119894(119895) minus 119888
1198952 is a chosen distance measure between a data
point 119909119894(119895) and the cluster center 119888
119895 which is an indicator of
the distance of the 119899 data points from their respective clustercenters The sum of distances or sum of squared Euclideandistances from the mean of each cluster is a quite normal orusual measure for causing scattering in all directions in thecluster in order to test the suitability of the KM algorithmClusters are often computed using a fast heuristic methodwhich generally produces good (but not necessarily optimal)solutions
X-Means [24] is an optimal method of KM whichimproves structure part in the algorithm Division of thecenters is attempted in its region It makes decision betweenthe root and children of each center to doing the comparisonbetween the two structures Another improved variant ofKM called EM which execrates maximization makes anassignment on a probability distribution to each further pointwhich represents the probability Howmany clusters to be setup are to be decided by EM using cross-validation
Density-based algorithms regard clusters as dense areasof objects that are separated by less dense areas [25] Becausethey have no limit to look for clusters with spherical shapethey can produce clusters with arbitrary shapes DBScan isa typical implementation of density-based algorithms calleddensity-based spatial clustering of applications with noise
6 International Journal of Distributed Sensor Networks
(a) (b) (c)
Figure 7 Illustration of possible ways of assigning clusters for maximum (a) fish population (b) altitude of terrain and (c) human inhabitantpopulation
[25] The notions of density reachability and density con-nectivity are used as performance indicators for the qualityof clustering [26] A cluster is composed of the group ofobjects in a dataset that are density connected to a particularcenter Any object that falls beyond a cluster is considered asnoise
Ward proposed a clustering method called hierarchicalclustering (HC) in 1963 [27] It tries to find how to formsomething to divide 119875
119899 119875119899minus1
1198751in a way that reduces
the relationship with each group In each step analysisstep it considered every possible cluster pair in group andcombined the two clusters with a very close joining of resultsin ldquoinformation lossrdquo which is given definition by Wardaround ESS (an error sum-of-squares criterion)The idea thatsupports Wardrsquos proposal can be described most simply bythinking of a little single data Take ten objects with scores asan example (2 7 6 6 7 2 2 0 2 0) The loss of informationwould be achieved by calculating ESS with a mean of 34which takes into account the ten scores as a unit as followsESS One group = (2 minus 34)
2+ (7 minus 34)
2+ sdot sdot sdot + (0 minus 34)
2=
4728 However those 10 objects can also be separated intofour groups according to their scores 0 0 0 2 2 2 26 6 and 7 7 Finally for evaluation of the ESS as a sum ofsquares we can obtain four independent error sums of eachsquare Overall the result that divides the 10 objects into 4clusters has no loss of information as follows
ESS One group = ESS group1 + ESS group2
+ ESS group3 + ESS group4 = 0
(2)
The last method we adopted here is linear programming(LP) which contains instituting and producing an answerto optimization problems with linear objective functionsand linear constraints This powerful tool can be used inmany fields especially where many options are possible inthe answers In spatial grouping over a large grid manypossible combinations of positioning the clusters exist Theproblem here is to find a certain number of clusters of
equal size over the area meanwhile the chosen centers ofthe clusters must allow sufficient distance apart from eachother so as to avoid overlapping As an example shownin Figure 7 three clusters would have to be assigned overa spatial area in a way that they would have to covercertain resources The assignment of the clusters howeverwould have to yield a maximum total value summed fromcovered resources In the example the left diagram showsallocating three clusters over the deepwater assuming that theresources are fish hence the maximum harvest The secondexample in the middle of Figure 7 is clustering the highaltitude over the area The last example is trying to coverthe maximum human inhabitants which are concentratedat the coves Given many possible ways of setting up theseclusters LP is used to formulate this allocation problemwith an objective of maximizing the values of the coveredresources
Assuming that the resources could be dynamic forexample animal herds or moving targets whose positionsmay swarm and change over time the optimization is atypical maximal flow problem (or max flow problem) Theoptimization is a type of network flow problem in whichthe goal is to determine the maximum amount of flowthat can occur over arc whish is limited by some capacityrestriction This type of network might be used to modelthe flow of oil in pipeline (in which the amount of oil thatcan flow through a pipe in a unit of time is limited by thediameter of the pipe) Traffic engineers also use this type ofnetwork to determine the maximum number of cars that cantravel through a collection of streets with different capacitiesimposed by the number of lanes in the streets and speed limits[28]
For our spatial clustering we consider each cell of the gridas a node each node is defined as a tuple119898 that contains thecoordinates and the value of the resource that is held in thenode such that 119898(119909
119894 119910119894 119911119894) represents an 119894th node in which
119909119894 119910119894represent the position and 119911
119894represents the value of
resource in the node respectively For the clusters each node
International Journal of Distributed Sensor Networks 7
(1) Load the grid-based spatial information into array 119860(119909 119910 119911) 119860 is a three dimensional array(2) Repeat (through all coordinates of 119909)(3) Repeat (through all coordinates of 119910)(4) If (boundary constraints and overlapping constraints are satisfied) Then(5) 119878(119909
119894 119910119894 119911119894) = 119860(119909
119894 119910119894 119911119894)
(6) End-if(7) End-loop(8) End-loop(9) If size of (119878) ge 119870
(10) Find top 119870 clusters where maxsum119911119894⨁119862119896 copy 119878(119909
119894 119910119894 119911119894) to new array 119862(119909
119894 119910119894 119911119894) forall119894 isin 119862
119896
(11) Else-if(12) 119862(119909
119894 119910119894 119911119894) = 119878(119909
119894 119910119894 119911119894) forall119894
(13) End-if
Pseudocode 1 Pseudocode of the proposed LP model for spatial clustering
can potentially be a center of a cluster and the cluster hasa fixed radius of length 119903 The LP model for our problem ismathematically shown as follows
Total value = ⋃
selected clusters ⟨119862119896|119896=1sdotsdotsdot119870⟩sum
119898119894isin119862119896
119898119894(lowast lowast 119911
119894)
= argmax119883119884
sum
0le119909119894le119883
0le119910119894le119884
119870
sum
119896=1
119911119897ni 119898119897(119909119894 119910119895 119911) oplus 119888
119896
(3)
Subject to the boundary constraints of 2r le |119909119894minus 119909119895| and 2r
le |119909119894minus 119909119895| for all 119894 and 119895 but 119894 = 119895 where119883 is the maximum
width and 119884 is the maximum length of the 2D spatial arearespectively 119896 isin 119870 is the maximum number of clusters and119888119896is the 119896th cluster under consideration in the optimizationIn order to implement the computation as depicted in
(3) for each node we sum each group resources in a shapeof diamond (which geometrically approximates a circle) Byiterating through every combination of119870 nodes in the grid ofsize 119883 by 119884 each current node in the combinations is beingtested by considering it as the center of a cluster that has aradius of r hence storing the resource values of the nodesfrom the potential clusters into a temporary array buffer119860(lowast lowast 119911
119894) The results from those potential clusters which
do satisfy the boundary and nonoverlapping constraints arethen copied to a candidate buffer Out of the clusters whoseresource values are stored in the candidate buffer 119878 thecombination of 119870 clusters that has the great total resourcevalue is selected and their values are placed in the final buffer119862 The corresponding pseudocode is shown in Pseudocode 1
5 Experimental Results and Analysis
In this section the performance of the proposed methodol-ogy is shown by presenting both numerical and visualizedresults for all performance aspects over various algorithms Acase study of road traffic is used in the experimentThe spatialarea is a metropolitan traffic map with roads and streetsspanning all over the place The resource value in this case isthe concentration or density of vehicle traffic flows Sensorsare assumed to have been deployed in every appropriate point
Table 1 Comparison between Bwmorph function and thinningalgorithm
Bwmorph function Thinning algorithmDataset 1 Dataset 2 Dataset 1 Dataset 2
Degree ofthinning Incomplete Complete
Elapsed time(secs) 20 38 100 198
Complexity 119874(119899) 119874(1198992)
of the roads thereby a typical traffic volume is each of thesepoints is known The optimization of spatial clustering inthis case can be thought of as optimal resource allocationfor example cost-effective police patrols gas stations orenvironment-pollution controls are needed among thosedense traffic spots
51 Data Preprocessing Two different factual datasets areused for experiments The first dataset is published byMaricopa Association of Governments in 2008 which isa traffic volume map Traffic volumes were derived fromthe national traffic recording devices Seasonal variation isfactored into the volumes The second dataset is an annualaverage daily traffic of Baltimore County Traffic VolumeMapin 2011 in USA prepared by the Maryland Department ofTransportation and published by March 19 2012 The trafficcount estimates are derived by taking 48-hourmachine countdata and applying factors frompermanent count stationsThetraffic counts represent the resource values in a general sense
After using skeleton extraction a two-tone image wasobtained from the original map Readers are referred to therespective websites where they can see the traffic volume datathat are associated with our two datasets (a) Representativetraffic volume map of dataset 1mdashTraffic Volume Map ofPhoenix AZUSA (httpphoenixgovstreetstrafficvolume-map) (b) Representative traffic volume map of dataset2mdashTraffic Volume Map of Baltimore MD USA (httpwwwmarylandroadscomTraffic Volume MapsTraffic VolumeMapspdf) And the corresponding result skeleton extraction
8 International Journal of Distributed Sensor Networks
Table 2 Important statistics from the clustering and LP experiments
Method Cluster number Number of cells covered Minimum Maximum Overlap
KM
Cluster 1 428 0 3499327 0Cluster 2 468 0 546896 0Cluster 3 448 0 20503007 0Cluster 4 614 0 6894667 0Cluster 5 618 0 900908 0
XM
Cluster 1 615 0 591265 0Cluster 2 457 0 546896 0Cluster 3 609 0 900908 0Cluster 4 465 0 3499327 0Cluster 5 430 0 20503007 0
EM
Cluster 1 1223 0 2292 61817229Cluster 2 7 141048 243705 313018Cluster 3 81 0 3033733 131146577Cluster 4 64 26752 546896 330881249Cluster 5 1201 0 1300026 217950471
DB
Cluster 1 13 23614 33146 327222911Cluster 2 11 1686825 21001 363965818Cluster 3 13 178888 2945283 196118393Cluster 4 11 847733 211008 58940877Cluster 5 2528 0 546896 20554176
HC
Cluster 1 291 0 3499327 0Cluster 2 191 0 20503007 96762283Cluster 3 294 0 1590971 0Cluster 4 224 0 189812 12673555Cluster 5 243 0 546896 0
LP
Cluster 1 221 0 3499327 0Cluster 2 221 0 20503007 0Cluster 3 221 0 1590971 0Cluster 4 221 0 189812 0Cluster 5 221 0 546896 0
Table 3 Comparison for running time of the first dataset
Formats KM HC DBscan XM EM LPVector database 327 1252 2324 278 930 183Raster database 342 1536 2820 284 984 201RasterP (16 grids) 198 134 508 046 057 078RasterP (25 grids) 009 014 115 021 012 053
in dataset 1 is shown in Figure 8 where (a) adopted a kind ofmorphological operation method and (b) adopted thinningalgorithm respectively Likewise the corresponding resultskeleton extraction in the second dataset is shown inFigure 9 where (a) adopted a kind of morphologicaloperation method and (b) adopted thinning algorithmrespectively The comparison result of the two datasets isshown in Table 1
For the raw dataset we firstly perform the image prepro-cessing over it to obtain numerical database
The results of the skeleton extraction as shown in Figures8(b) and 9(b) are more clearly and useful for the following
processing Subsequently the clustering by grid can bereadily obtained from the preprocessed imagesThe extent ofimage thinning is better and more complete by the thinningalgorithm than the Bwmorph function in MATLAB But theelapsed time is longer due to a two-layer iteration nestingprocedure in the program code
The choice of placing a grid on the image follows oneprinciple mesh segmentation is not trying to fall on a con-centrated position of traffic flow Since there is no endpointthe midpoint of the two adjacent values was considered ademarcation point Under this assumption the traffic flow ineach grid is calculated and stored digitally in an Excel file Adigital data for the trafficmap serves as the initial data for thesubsequent clustering process
52 Comparison Result of KM and HC Clustering InXLMiner two methods were used to perform clustering KMand HC In order to compare the two methods for the twodatasets input variables were normalized and the numberof clusters is set at five and maximum iterations at 100 Theinitial centroids are chosen randomly at start Furthermore
International Journal of Distributed Sensor Networks 9
Table 4 Comparison for log-likelihood of first dataset
Formats KM HC DBScan XM EMVector database minus1241868 minus1407265 minus1328599 minus119533 minus1249562Raster database minus1342238 minus1502863 minus1378889 minus129632 minus1339769RasterP (16 grids) 1262264 minus1402266 minus1248583 minus1239419 minus1244993RasterP (25 grids) minus1241868 minus1319417 minus1122207 minus1248201 minus1162048
(a) (b)
Figure 8 (a) Result of skeleton extraction in dataset 1 using Bwmorph function (b) Result of skeleton extraction in dataset 1 using thinningalgorithm
(a) (b)
Figure 9 (a) Result of skeleton extraction in dataset 2 using Bwmorph function (b) Result of skeleton extraction in dataset 2 using thinningalgorithm
Table 5 Comparison for running time of the second dataset
Formats KM HC DBScan XM EM LPVector database 139 134 1553 153 1005 337Raster database 241 1478 1834 217 823 196RasterP (16 grids) 047 801 1274 045 377 144RasterP (25 grids) 035 620 1098 036 296 118
the weights for the corresponding three attributes (119909 119910 V)for each grid (119892
119894= (119909119894 119910119894 V119894)) based on defining weight of
119909 and 119910 could be varied (fine-tuned) and the sum of weightsmust be equal to 1 We tested several variations searching forthe best clustering results (1) weight of V is 20 (2) weightof V is 40 (3) weight of V is 50 (4) weight of V is 60 (5)weight of V is 80 (6) all of them have same weight at 333
(7) weight of V is 0 (8) same weight except when 119892119894(V119894= 0)
and (9) weights of 119909 and 119910 are both 0 except when 119892119894(V119894= 0)
In HC method normalization of the input data waschosen Another option available is similarity measure Itadopts Euclidean distance to measure raw numeric dataMeanwhile the other two options Jaccardrsquos coefficients andmatching coefficient are activated only when the data isbinary
For the above nine cases results of cases (1) to (6) aresimilar in their separate methods And result of (9) is theworst which does not accomplish any clustering Results ofcases (2) (3) (7) and (8) are demonstrated in Figure 10
For the distribution of clusters in the result of KMclustering method more than half of data points are clampedinto one oversized cluster The result of this method istherefore not helpful for further operation For HC methoddata on average are allocated into separate clustersThe result
10 International Journal of Distributed Sensor Networks
Table 6 Comparison for log-likelihood of second dataset
Formats KM HC DBScan XM EMVector database minus1735412 minus1962367 minus1753576 minus1721513 minus1657263Raster database minus1815926 minus2012568 minus1970756 minus1815791 minus1848209RasterP (16 grids) minus1551437 minus1724736 minus1637147 minus1701283 minus1566231RasterP (25 grids) minus1484761 minus1663789 minus1509146 minus1667312 minus1647823
1
2
5
4
41
3
32
5
(a)
1
2 3
4
5
4 51
3
2
(b)
5
5
4
1
4
1 3
3
2
2
5
(c)
2
2
4
3
3
1 5
5
4
1
(d)
Figure 10 (a) Clustering results for the first dataset with setting case (2) where weight of V is 40 top half uses KM clustering methodand bottom half uses HC method (b) Clustering results for the first dataset with setting case (3) where weight of V is 50 top half uses KMclustering method and bottom half uses HC method (c) Clustering results for the first dataset with setting case (7) where weight of V is 0top half uses KM clustering method and bottom half uses HC method (d) Clustering results for the first dataset with setting case (3) whereall share the same weight except 119892
119894(V119894= 0) top half uses KM clustering method and bottom half uses HC method
Table 7 Comparison of running time (in seconds) of four differentsizes of dataset
Dataset size KM HC DBScan XM EM LP100 grid cells 006 007 105 219 321 0184600 grid cells 042 295 3989 273 1905 93710000 grid cells 262 4667 9755 297 3785 242180000 grid cells 1975 18961 684 647 19831 9083
in Figure 10(c) is the best showing only the one with distinctposition attributes (119909 and 119910) The other three results (Figures10(a) 10(b) and 10(d)) are stained with cluster overlapsTherefore allocation of critical resource for example in eachcluster may result in a waste of resources The degree ofoverlap is the least in the result of Figure 10(b) If only locationis being considered the result of Figure 10(c) is the bestchoice Otherwise the result in Figure 10(b) is better than theother two for the sake of cluster distribution
The clustering results of the second dataset performanceby using the two methods KM and HC are shown inFigure 11
From the results of the cluster distribution of the seconddataset obtained by both clustering methods the size of eachcluster is more or less similar which is better than that ofthe first dataset And there is no overlap phenomenon inthe KM results This is a promising feature of KM methodfor spatial clustering However there is little overlap in theresult of HC method as the clusters seem to take irregularshapes Above all for the second dataset KM is a better choicefor consideration of even cluster distribution and overlapavoidance by using both clustering methods
53 Results of Grouping In this part we compare the coloredmap of Raster (119909 119910 V) data model in two datasets usingfive clustering methods in Weka and the LP method Thecommon requirement is no overlap for each of the resultingmaps The number of cluster is arbitrarily chosen at five The
International Journal of Distributed Sensor Networks 11
Table 8 Numeric results of coverage of each cluster by using the six methods for dataset 1
Cov-db1 KM EM DBScan XM HC LPCluster 0 0029436 0003786 0017902 0075178 0013153 0028985Cluster 1 0301538 0269602 0208078 0049761 0026016 0377034Cluster 2 0215277 0001627 0158439 0084049 012436 0080099Cluster 3 0046788 0096221 0079177 0209390 0001172 0217204Cluster 4 0002712 0161799 0044197 0043152 03043 0007704Total coverage 0595751 0533036 0507793 0461531 0469 0711025
41 5
3 2
(a)
4
13
5
2
(b)
Figure 11 (a) Clustering results for the second dataset by usingKMmethod (b) Clustering results for the second dataset by usingHCmethod
result of first dataset is shown in Figure 12 The first part (i)of Figure 12 shows the spatial clustering result the secondpart (ii) visualizes the corresponding spatial groups by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method The centers of the clustersare computed after clustering is done and then the groupsare visualized over the clustering results according to thecomputed centers
In Figure 12 for the results of (a) and (e) the sizes ofclusters are quite uneven more than half of dataset fall intoone cluster Thus this result reveals a fact that the techniquecannot organize a dataset into homogeneous andor well-separated groups with respect to a distance or equivalentlya similarity measureThe corresponding groups have overlapphenomenon too For the result of (c) the sizes of the clustersare uneven too For the result of (b) and (d) the sizes of clusterseem to be similar to each other There is also no overlapin the clustering result but for group result the groups in(d) have far more overlaps than those in (b) Overlap meanssome part or the cluster gets in the way of another onewhichmeans that there is superposition between two ormoredifferent clusters Again itmay cause resourcewaste and evenfalse allocation This situation occurs in important fields ofapplications such as information retrieval (several thematicfor a single document) and biological data (several metabolicfunctions for one gene) For this reason (b) is better than (d)According to the above analysis for the result of clusteringand corresponding groups (d) XM is so far the best choiceof clustering algorithm as evidenced by the colored mapsthereafter
With the same experiment setup and operating environ-ment the spatial clustering experiments are performed overthe second dataset The results of second dataset are shown
in Figure 13 where (i) represents the spatial clustering resultand (ii) represents the corresponding spatial group by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method
In Figures 13(a) and 13(e) it is noticed that the clustersare imbalanced and there are overlaps in the correspondingspatial groups using the method of (a) EM and (e) DBScanThe results of (b) KM and (d) XM however avoid theshortcomings of (a) and (e) though they still have slightoverlaps For (c) HC we remove the empty cells in theboundary to reduce the size of dataset the clustering resultis perfect There is no overlap and clusters are balancedbetween each other But there is still overlap in the spatialgroups Thus LP method is adopted to solve this problemand in possession of same size of groups The result of LPmethod yields perfectly balanced groups without any overlapas shown in Figure 13(f)
By visually comparing the clustering results of the twodatasets the clustering results seem to be similar but thespatial groups are somewhat differentOccurrence of overlapsin spatial groups is more severe in the first dataset than in thesecond one The overlaps are likely due to data distributionand balance in sizes between each cluster For the first datasetthe heavy spatial values which are traffic volumes in this caseare mainly concentrated in the center region (city center) solocations of the computed clusters tend to cram very near ata crowded spot In contrast the traffic volumes in the seconddataset are dispersed over a large area As seen from the visualspatial groups of the second dataset the cluster positions area little far apart when compared to those in the first dataset
Based on the results generated from the clustering andLP experiments some statistic information of dataset 2 iscollected and it is shown in Table 2 The numeric results in
12 International Journal of Distributed Sensor Networks
(i)
(ii)
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(ii)
(i)
(d)
(i)
(ii)
(e)
Figure 12 (a) (i) Spatial clustering on dataset 1 by using EM (ii) Spatial groups on dataset 1 from the results of using EM (b) (i) spatialclustering on dataset 1 by using KM (ii) Spatial groups on dataset 1 from the results of using KM (c) (i) spatial clustering on dataset 1 byusing HC (ii) Spatial groups on dataset 1 from the results of using HC (d) (i) spatial clustering on dataset 1 by using XM (ii) Spatial groupson dataset 1 from the results of using XM (e) (i) spatial clustering on dataset 1 by using DBScan (ii) Spatial group on dataset 1 from theresults of using DBScan
Table 3 support the qualitative analysis by visual inspectionin the previous section By comparingHC and LPmethods asan example the quantitative results show that they have thegreatest differences in cell numbers covered by the clustersalso the amount of overlap in HC is the highest of all By theLP method the size of each cluster is exactly the same andthey are totally free from overlap
6 Technical Analysis of Clustering Results
61 Experimental Evaluation Method For the purpose ofassessing how the qualities of spatial groups from clusteringare several evaluation factors are defined here running time(short for time) balance log-likelihood overlap densityand coverage For a fair comparison the datasets are run inthe same software environment of the same computer And
assume the number of groups to be five with six differentmethods Running time is the time we used to run eachmethod using the same software in the same computerto completion Balance is used to measure the sizes ofgroups if balanced the size of each group is the same Log-likelihood is an important measure for clustering qualitythe bigger the value the better Weka tests for goodness-of-fit by the likelihood in logarithm which is called log-likelihood A large log-likelihood means that the clusteringmodel is suitable for the data under test Overlap means thatthe spatial values (eg traffic volumes sensed by the sensors)do belong to more than one cluster Density is the averagespatial values (traffic volumes) per grid cell in each clusterCoverage of a cluster means the proportion of traffic volumesthat are covered by the grid cells within the cluster overthe whole dataset meanwhile total coverage is the sum of
International Journal of Distributed Sensor Networks 13
(i)
(ii)
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(i)
(ii)
(d)
(i)
(ii)
(e)
(f)
Figure 13 (a) (i) spatial clustering on dataset 2 by using EM (ii) Spatial groups on dataset 2 from the results of using EM (b) (i) spatialclustering on dataset 2 by using KM (ii) Spatial groups on dataset 2 from the results of using KM (c) (i) spatial clustering on dataset 2 byusing HC (ii) Spatial groups on dataset 2 from the results of using HC (d) (i) spatial clustering on dataset 2 by using XM (ii) Spatial groupson dataset 2 from the results of using XM (e) (i) spatial clustering on dataset 2 by using DBScan (ii) Spatial group on dataset 2 from theresults of using DBScan (f) Spatial group in LP method on dataset 2
14 International Journal of Distributed Sensor Networks
traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below
Density (cluster 119894) =sumTraffic Volumes (cluster 119894)
Grid Cell Number (cluster 119894)
Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)
sumGrid Cell Number
Total Coverage = sumTraffic Volumes minusOverlaps
Proportion of Cluster (119894) Size (Balance)
=Grid Cell Number (cluster 119894)
sumGrid Cell Number
(4)
62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively
According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest
In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one
In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14
According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On
0
100
200
300
400
500
600
700
800
900
0 20000 40000 60000 80000 100000
K-meansHierarchicalDBScanXMeanEM
Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP
K-means)
Figure 14 Comparison of running time (in seconds) of differentsizes of dataset
the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally
In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one
In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC
The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets
From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage
International Journal of Distributed Sensor Networks 15
Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2
Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700
0
01
02
03
04
05
06
07
08
KM EM DBScan XM HC LP
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
CoverageCoverage
(a)
0
01
02
03
04
05
06
KM EM DBScan XM HC LP
Coverage
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
Coverage
(b)
Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2
Total density
0
200
400
600
800
1000
1200
1400
1600
KM EM DBScan XM HC LP
Density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(a)
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
KM EM DBScan XM HC LP
Density
Total density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(b)
Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2
16 International Journal of Distributed Sensor Networks
4
5136
8
1
Balance test on dataset 1
(a) KM
1
50
1
18
30
Balance test on dataset 1
(b) XM
6
22
24
30
18
Balance test on dataset 1
(c) EM
24
24
17
20
15
Balance test on dataset 1
(d) DBScan
18
17
22
19
25
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP
From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means
that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered
International Journal of Distributed Sensor Networks 17
17
18
17
24
24
Balance test on dataset 2
(a) KM
24
18
24
18
17
Balance test on dataset 2
(b) XM
47
032
47
Balance test on dataset 2
(c) EM
1010
98
Balance test on dataset 2
(d) DBScan
23
15
24
18
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP
The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely
63 Discussion of G119899119890119905 For all the six evaluation factors each
of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in
18 International Journal of Distributed Sensor Networks
Table 10 Numeric results of density of each cluster by using the six methods for dataset 1
Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049
Table 11 Numeric results of density of each cluster by using the six methods for dataset 2
Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447
order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866
119888can take a relatively very large value or even
1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes
119866119897=
10038161003816100381610038161003816100381610038161003816
LikelihoodTime
10038161003816100381610038161003816100381610038161003816
(5)
119866119887=Difference of Balance
Time (6)
119866119889=DensityTime
(7)
119866119888=CoverageTime
(8)
119866119900=OverlapTime
(9)
119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)
Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)
From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in
Table 12 which allows us to easily compare various methodsand performance aspects
In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866
119897 119866119887 119866119889 119866119888 and 119866
119900are computed
for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13
According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion
7 Conclusion and Future Works
Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for
International Journal of Distributed Sensor Networks 19
Table 12 Performance indicators of the six methods based on dataset 2
Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0
Table 13 Comparison of different clustering and LP methods by119866net indicator
Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132
purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users
The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes
For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be
good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed
References
[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000
[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012
[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012
[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012
[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012
[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003
[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006
[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002
[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008
[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006
[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004
[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998
[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000
20 International Journal of Distributed Sensor Networks
[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004
[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005
[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010
[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998
[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006
[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999
[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001
[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009
[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999
[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000
[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996
[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963
[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
International Journal of Distributed Sensor Networks 3
Nevertheless the spatial groups obtained by classic algo-rithms have certain limitations that is overlaps cannot becontrolled and the maximum coverage by the resultantgroups is not guaranteed Overlaps lead to resource wasteand potentially resourcemismatch Besides spatial clusteringthis situation occurs in other fields of applications suchas the information retrieval (several thematic for a singledocument) biological data (several metabolic functions forone gene) and martial purpose (discover object densenessregions independently) However there has been no studyreported in the literature that the authors are aware ofthat applies LP method to discover spatial groups that arefree of the limitations inherited from clustering algorithmsThus this research provides an alternative method to achievespatial groups for maximum coverage in real environmentMaximum coverage in this context here is defined as thegreatest possible area of effect covered by the spatial groupswith none or minimum overlaps among the groups
3 Spatial Data Representation
Two main categories of spatial data representation existspatial data and attribute data Spatial data means refer-enced data in the earth such as maps photographs andsatellite imageries Though these representation techniquesoriginated from GIS the underlying coding formats arecommon compared to those for wireless sensor networksas long as they are distributed over a wide spatial area innature Generally spatial data represents geographic featuresin complete and relative locations Attribute data representsthe spatial features in characteristics which can be in quantityandor in quality in real world Attribute data is often referredto as tabular data In our experiments we test both typesof data models versus different clustering algorithms for athorough investigation
31 Spatial Data Model In early days spatial data is storedand represented in a map format There are three fundamen-tal types of spatial data models for recording the geographicdata digitally They are vector raster and image
Figure 1 as shown in the following illustrates the encodingtechniques of two important spatial data [19] which are rasterand vector over a sample aerial image of Adriatic Sea andcoast in Italy The image type of encoding is very similar toraster data in terms of usability of techniques But it has a limitin internal formats when it comes to modeling and analysisof the data Images represent photographs or pictures in thelandscape in a coarse matrix of pixel values
32 Vector Data Model The three kinds of a forementionedspatial data models are used in storing the geographiclocation with spatial features in dataset The vector datamodel uses 119909 119910 coordinates to define the locations offeatures thereafter theymark points lines areas or polygonsTherefore vector data tend to define centers edges andoutlines of features It characterizes the features by linearsegments using sequential points or vertices A vertex consistsof a pair of 119909 and 119910 coordinates The beginning or ending
of a node is defined in each vertex with arc segment Asingle coordinate pair of vertexes defines a feature point Agroup of coordinate pairs define polygonal features In vectorrepresentation as well as the connectivity between featuresthe storage of the vertices for each feature is important as wellas the sharing of common vertices where features connect
By using the same size polygonal we divide a completemap into small units based on the character of our databasewhich is represented to be (119909 119910 V) where 119909 and 119910 consistof an coordinate pair that represents the referenced spatialposition and V represents something of interest or just calledldquofeaturerdquo which could be a military target a critical resourceor just an inhabitant clan for example The greater the V themore valuable the feature is In spatial grouping formaximumcoverage we opt to include these features that amount to ahighest total value A sample of vector format that representsa spatial location in reference to 2D is shown in Figure 2 [19]
33 RasterDataModel Raster datamodelsmake use of a gridof squares to define where features are located These squareswhich are also called pixels or cells typically are of uniformsize
From our dataset we separate the whole image by impos-ing a grid on it hence producing many individual featureswith one feature corresponding to each cell We considerusing raster data model to represent the dataset and we storethe features by the following two different encoding formats
(1) Raster data are stored as an ordered list of cell valuesin pairs of (119894 V) where 119894 is a sequential number of thecell indices and V is the value of the 119894th feature forexample (1 80) (2 80) (3 74) (4 62) (5 45) and soon as shown in Figure 3
(2) Raster data are stored as points (119909 119910 V) with 119909 and 119910
as position coordinates locating to the correspondingspatial feature with value V for example (1 1 513) (12 514) (1 3 517) (2 1 512) (2 2 515) and so on asshown in Figure 4 In this case the value V refers tothe center point of the grid cell This encoding will beuseful for representing measured values at the centerpoint of the cell for example raster of elevation
(3) During the experiment the grid size is transformedfor efficient operation So we put 1198942 cells together asone unit representing one new grid cell as shown inFigure 5
In particular the quad tree data structure for storingthe data is found to be useful as an alternative encodingmethod to raster data model Raster embraces digital aerialphotographs imagery from satellites digital pictures or evenscanned maps Details on how different sorts of objects likepoint line polygon and terrain are represented by the datamodels can be found in [19ndash21]
4 Proposed Methodology
Theaim of themethodology is to determine a certain numberof clusters and their corresponding locations from some
4 International Journal of Distributed Sensor Networks
Real world Vector Rasterimage
Figure 1 Representation of how a real-world spatial area is represented by vector and raster encoding formats
13121110
987654321
0 1 2 3 4 5 6 7 8 9 10 11 12 13Width
Hei
ght
Colu
mn
Row
x y coordinates are 9 3
Figure 2 Vector format
80 74 62 45 45 34 39 56
80 74 74 62 45 34 39 56
74 74 62 62 45 34 39 39
62 62 45 45 34 34 34 39
45 45 45 34 34 30 34 39
Figure 3 Raster format in ordered list
collected spatial data In this process different methods aretested for choosing the one which covers themost area as wellas the highest feature values from the suggested clustersTheflow of this process including preprocessing of sensor data
515 519 521 523
523
523517 528 527
512 510 520
518511 512 516
514
517
511
510
512 516 517 520513
Figure 4 Raster data with center point
80
80
74
62
45
74
74
74
62
45
62
74
62
45
45
45
62
62
45
34
45
45
45
34
34
34
34
34
34
30
39
39
39
34
34
56
56
39
39
39
Figure 5 Raster format with 22 and 32 grids
data transformation clustering and finding cluster center-points is shown in Figure 6
In case of a satellite image or image captured by fighter-jetor other surveillance camera image processing is needed to
International Journal of Distributed Sensor Networks 5
Load spatial image
RGB image
Gray imageSkeleton extraction
Morphological operation in MATLAB (Bwmorph)
Zhangrsquos algorithmare used forcomparison
Two-tone image
Indexed grid image
2D special data
Method comparison
Griddingindexing image
Numerical dataset (with normalization)
Spatial grouping
Hierarchical K-means LPDBScan
Color map
Output
2 algorithms
middot middot middot
Preprocessing of imageData transformation
GroupingDisplay
Figure 6 Workflow of proposed methodology
extract the density information from pictures But in our caseof sensor network we can safely assume that the data fed fromanet of sensorswould have the sensor ID attachedThe sensorIDs are known so are their positions From the locations ofthe sensors and their sensor ID we could possibly relate thedata that was collected to their corresponding locations inthe 119909-119910 format of coordinates (assume the terrain is of 2D)In order to reduce the huge amount of calculation and storagespace a grid was used to divide the whole map into smallerpieces The grid indexing operation is repeated for a range ofdifferent coarse layers thereby providing different resolutionsof data partitions Similar technique is reported in [22] whichis computed by Euclidian distance Obviously the methodof grid indexing helps separate data into cells based on theirgeographic locations
To obtain a better result of spatial groups for maximumcoverage and its corresponding cluster center point with cer-tain constrains the research adopts several popular cluster-ing methods and linear programming method by using soft-ware programs such as XLMiner (httpwwwsolvercomxlminer-data-mining) MATLAB (httpwwwmathworkscomproductsmatlab) and Weka (httpwwwcswaikatoacnzmlweka)
The core purpose of cluster analysis is to comprehend andto distinguish the extent of similarity or dissimilarity amountof the independent clustered objects There are five majormethods of clusteringmdashKM EM XM HC and DBScan
119870-means (KM) byMacQueen 1967 is one of the simplestalgorithms that solve thewell-known clustering problem [23]It is an easy and simple method to divide a dataset into acertain number of clusters initially assuming that the numberof clusters is 119896 fixed a priori for each cluster which is themainidea The random choice of the initial location of centroids
leads to various results A better choice is to place them asmuch far away from each other as possible
The KM algorithm aims at minimizing an objective fun-ction In this case a squared error function is as follows
119895 = sum
forall119894
sum
forall119895
10038171003817100381710038171003817119909119894(119895) minus 119888
119895
10038171003817100381710038171003817
2
(1)
where 119895 ranges from 1 to 119896 119894 range from 1 to 119899 and119909119894(119895) minus 119888
1198952 is a chosen distance measure between a data
point 119909119894(119895) and the cluster center 119888
119895 which is an indicator of
the distance of the 119899 data points from their respective clustercenters The sum of distances or sum of squared Euclideandistances from the mean of each cluster is a quite normal orusual measure for causing scattering in all directions in thecluster in order to test the suitability of the KM algorithmClusters are often computed using a fast heuristic methodwhich generally produces good (but not necessarily optimal)solutions
X-Means [24] is an optimal method of KM whichimproves structure part in the algorithm Division of thecenters is attempted in its region It makes decision betweenthe root and children of each center to doing the comparisonbetween the two structures Another improved variant ofKM called EM which execrates maximization makes anassignment on a probability distribution to each further pointwhich represents the probability Howmany clusters to be setup are to be decided by EM using cross-validation
Density-based algorithms regard clusters as dense areasof objects that are separated by less dense areas [25] Becausethey have no limit to look for clusters with spherical shapethey can produce clusters with arbitrary shapes DBScan isa typical implementation of density-based algorithms calleddensity-based spatial clustering of applications with noise
6 International Journal of Distributed Sensor Networks
(a) (b) (c)
Figure 7 Illustration of possible ways of assigning clusters for maximum (a) fish population (b) altitude of terrain and (c) human inhabitantpopulation
[25] The notions of density reachability and density con-nectivity are used as performance indicators for the qualityof clustering [26] A cluster is composed of the group ofobjects in a dataset that are density connected to a particularcenter Any object that falls beyond a cluster is considered asnoise
Ward proposed a clustering method called hierarchicalclustering (HC) in 1963 [27] It tries to find how to formsomething to divide 119875
119899 119875119899minus1
1198751in a way that reduces
the relationship with each group In each step analysisstep it considered every possible cluster pair in group andcombined the two clusters with a very close joining of resultsin ldquoinformation lossrdquo which is given definition by Wardaround ESS (an error sum-of-squares criterion)The idea thatsupports Wardrsquos proposal can be described most simply bythinking of a little single data Take ten objects with scores asan example (2 7 6 6 7 2 2 0 2 0) The loss of informationwould be achieved by calculating ESS with a mean of 34which takes into account the ten scores as a unit as followsESS One group = (2 minus 34)
2+ (7 minus 34)
2+ sdot sdot sdot + (0 minus 34)
2=
4728 However those 10 objects can also be separated intofour groups according to their scores 0 0 0 2 2 2 26 6 and 7 7 Finally for evaluation of the ESS as a sum ofsquares we can obtain four independent error sums of eachsquare Overall the result that divides the 10 objects into 4clusters has no loss of information as follows
ESS One group = ESS group1 + ESS group2
+ ESS group3 + ESS group4 = 0
(2)
The last method we adopted here is linear programming(LP) which contains instituting and producing an answerto optimization problems with linear objective functionsand linear constraints This powerful tool can be used inmany fields especially where many options are possible inthe answers In spatial grouping over a large grid manypossible combinations of positioning the clusters exist Theproblem here is to find a certain number of clusters of
equal size over the area meanwhile the chosen centers ofthe clusters must allow sufficient distance apart from eachother so as to avoid overlapping As an example shownin Figure 7 three clusters would have to be assigned overa spatial area in a way that they would have to covercertain resources The assignment of the clusters howeverwould have to yield a maximum total value summed fromcovered resources In the example the left diagram showsallocating three clusters over the deepwater assuming that theresources are fish hence the maximum harvest The secondexample in the middle of Figure 7 is clustering the highaltitude over the area The last example is trying to coverthe maximum human inhabitants which are concentratedat the coves Given many possible ways of setting up theseclusters LP is used to formulate this allocation problemwith an objective of maximizing the values of the coveredresources
Assuming that the resources could be dynamic forexample animal herds or moving targets whose positionsmay swarm and change over time the optimization is atypical maximal flow problem (or max flow problem) Theoptimization is a type of network flow problem in whichthe goal is to determine the maximum amount of flowthat can occur over arc whish is limited by some capacityrestriction This type of network might be used to modelthe flow of oil in pipeline (in which the amount of oil thatcan flow through a pipe in a unit of time is limited by thediameter of the pipe) Traffic engineers also use this type ofnetwork to determine the maximum number of cars that cantravel through a collection of streets with different capacitiesimposed by the number of lanes in the streets and speed limits[28]
For our spatial clustering we consider each cell of the gridas a node each node is defined as a tuple119898 that contains thecoordinates and the value of the resource that is held in thenode such that 119898(119909
119894 119910119894 119911119894) represents an 119894th node in which
119909119894 119910119894represent the position and 119911
119894represents the value of
resource in the node respectively For the clusters each node
International Journal of Distributed Sensor Networks 7
(1) Load the grid-based spatial information into array 119860(119909 119910 119911) 119860 is a three dimensional array(2) Repeat (through all coordinates of 119909)(3) Repeat (through all coordinates of 119910)(4) If (boundary constraints and overlapping constraints are satisfied) Then(5) 119878(119909
119894 119910119894 119911119894) = 119860(119909
119894 119910119894 119911119894)
(6) End-if(7) End-loop(8) End-loop(9) If size of (119878) ge 119870
(10) Find top 119870 clusters where maxsum119911119894⨁119862119896 copy 119878(119909
119894 119910119894 119911119894) to new array 119862(119909
119894 119910119894 119911119894) forall119894 isin 119862
119896
(11) Else-if(12) 119862(119909
119894 119910119894 119911119894) = 119878(119909
119894 119910119894 119911119894) forall119894
(13) End-if
Pseudocode 1 Pseudocode of the proposed LP model for spatial clustering
can potentially be a center of a cluster and the cluster hasa fixed radius of length 119903 The LP model for our problem ismathematically shown as follows
Total value = ⋃
selected clusters ⟨119862119896|119896=1sdotsdotsdot119870⟩sum
119898119894isin119862119896
119898119894(lowast lowast 119911
119894)
= argmax119883119884
sum
0le119909119894le119883
0le119910119894le119884
119870
sum
119896=1
119911119897ni 119898119897(119909119894 119910119895 119911) oplus 119888
119896
(3)
Subject to the boundary constraints of 2r le |119909119894minus 119909119895| and 2r
le |119909119894minus 119909119895| for all 119894 and 119895 but 119894 = 119895 where119883 is the maximum
width and 119884 is the maximum length of the 2D spatial arearespectively 119896 isin 119870 is the maximum number of clusters and119888119896is the 119896th cluster under consideration in the optimizationIn order to implement the computation as depicted in
(3) for each node we sum each group resources in a shapeof diamond (which geometrically approximates a circle) Byiterating through every combination of119870 nodes in the grid ofsize 119883 by 119884 each current node in the combinations is beingtested by considering it as the center of a cluster that has aradius of r hence storing the resource values of the nodesfrom the potential clusters into a temporary array buffer119860(lowast lowast 119911
119894) The results from those potential clusters which
do satisfy the boundary and nonoverlapping constraints arethen copied to a candidate buffer Out of the clusters whoseresource values are stored in the candidate buffer 119878 thecombination of 119870 clusters that has the great total resourcevalue is selected and their values are placed in the final buffer119862 The corresponding pseudocode is shown in Pseudocode 1
5 Experimental Results and Analysis
In this section the performance of the proposed methodol-ogy is shown by presenting both numerical and visualizedresults for all performance aspects over various algorithms Acase study of road traffic is used in the experimentThe spatialarea is a metropolitan traffic map with roads and streetsspanning all over the place The resource value in this case isthe concentration or density of vehicle traffic flows Sensorsare assumed to have been deployed in every appropriate point
Table 1 Comparison between Bwmorph function and thinningalgorithm
Bwmorph function Thinning algorithmDataset 1 Dataset 2 Dataset 1 Dataset 2
Degree ofthinning Incomplete Complete
Elapsed time(secs) 20 38 100 198
Complexity 119874(119899) 119874(1198992)
of the roads thereby a typical traffic volume is each of thesepoints is known The optimization of spatial clustering inthis case can be thought of as optimal resource allocationfor example cost-effective police patrols gas stations orenvironment-pollution controls are needed among thosedense traffic spots
51 Data Preprocessing Two different factual datasets areused for experiments The first dataset is published byMaricopa Association of Governments in 2008 which isa traffic volume map Traffic volumes were derived fromthe national traffic recording devices Seasonal variation isfactored into the volumes The second dataset is an annualaverage daily traffic of Baltimore County Traffic VolumeMapin 2011 in USA prepared by the Maryland Department ofTransportation and published by March 19 2012 The trafficcount estimates are derived by taking 48-hourmachine countdata and applying factors frompermanent count stationsThetraffic counts represent the resource values in a general sense
After using skeleton extraction a two-tone image wasobtained from the original map Readers are referred to therespective websites where they can see the traffic volume datathat are associated with our two datasets (a) Representativetraffic volume map of dataset 1mdashTraffic Volume Map ofPhoenix AZUSA (httpphoenixgovstreetstrafficvolume-map) (b) Representative traffic volume map of dataset2mdashTraffic Volume Map of Baltimore MD USA (httpwwwmarylandroadscomTraffic Volume MapsTraffic VolumeMapspdf) And the corresponding result skeleton extraction
8 International Journal of Distributed Sensor Networks
Table 2 Important statistics from the clustering and LP experiments
Method Cluster number Number of cells covered Minimum Maximum Overlap
KM
Cluster 1 428 0 3499327 0Cluster 2 468 0 546896 0Cluster 3 448 0 20503007 0Cluster 4 614 0 6894667 0Cluster 5 618 0 900908 0
XM
Cluster 1 615 0 591265 0Cluster 2 457 0 546896 0Cluster 3 609 0 900908 0Cluster 4 465 0 3499327 0Cluster 5 430 0 20503007 0
EM
Cluster 1 1223 0 2292 61817229Cluster 2 7 141048 243705 313018Cluster 3 81 0 3033733 131146577Cluster 4 64 26752 546896 330881249Cluster 5 1201 0 1300026 217950471
DB
Cluster 1 13 23614 33146 327222911Cluster 2 11 1686825 21001 363965818Cluster 3 13 178888 2945283 196118393Cluster 4 11 847733 211008 58940877Cluster 5 2528 0 546896 20554176
HC
Cluster 1 291 0 3499327 0Cluster 2 191 0 20503007 96762283Cluster 3 294 0 1590971 0Cluster 4 224 0 189812 12673555Cluster 5 243 0 546896 0
LP
Cluster 1 221 0 3499327 0Cluster 2 221 0 20503007 0Cluster 3 221 0 1590971 0Cluster 4 221 0 189812 0Cluster 5 221 0 546896 0
Table 3 Comparison for running time of the first dataset
Formats KM HC DBscan XM EM LPVector database 327 1252 2324 278 930 183Raster database 342 1536 2820 284 984 201RasterP (16 grids) 198 134 508 046 057 078RasterP (25 grids) 009 014 115 021 012 053
in dataset 1 is shown in Figure 8 where (a) adopted a kind ofmorphological operation method and (b) adopted thinningalgorithm respectively Likewise the corresponding resultskeleton extraction in the second dataset is shown inFigure 9 where (a) adopted a kind of morphologicaloperation method and (b) adopted thinning algorithmrespectively The comparison result of the two datasets isshown in Table 1
For the raw dataset we firstly perform the image prepro-cessing over it to obtain numerical database
The results of the skeleton extraction as shown in Figures8(b) and 9(b) are more clearly and useful for the following
processing Subsequently the clustering by grid can bereadily obtained from the preprocessed imagesThe extent ofimage thinning is better and more complete by the thinningalgorithm than the Bwmorph function in MATLAB But theelapsed time is longer due to a two-layer iteration nestingprocedure in the program code
The choice of placing a grid on the image follows oneprinciple mesh segmentation is not trying to fall on a con-centrated position of traffic flow Since there is no endpointthe midpoint of the two adjacent values was considered ademarcation point Under this assumption the traffic flow ineach grid is calculated and stored digitally in an Excel file Adigital data for the trafficmap serves as the initial data for thesubsequent clustering process
52 Comparison Result of KM and HC Clustering InXLMiner two methods were used to perform clustering KMand HC In order to compare the two methods for the twodatasets input variables were normalized and the numberof clusters is set at five and maximum iterations at 100 Theinitial centroids are chosen randomly at start Furthermore
International Journal of Distributed Sensor Networks 9
Table 4 Comparison for log-likelihood of first dataset
Formats KM HC DBScan XM EMVector database minus1241868 minus1407265 minus1328599 minus119533 minus1249562Raster database minus1342238 minus1502863 minus1378889 minus129632 minus1339769RasterP (16 grids) 1262264 minus1402266 minus1248583 minus1239419 minus1244993RasterP (25 grids) minus1241868 minus1319417 minus1122207 minus1248201 minus1162048
(a) (b)
Figure 8 (a) Result of skeleton extraction in dataset 1 using Bwmorph function (b) Result of skeleton extraction in dataset 1 using thinningalgorithm
(a) (b)
Figure 9 (a) Result of skeleton extraction in dataset 2 using Bwmorph function (b) Result of skeleton extraction in dataset 2 using thinningalgorithm
Table 5 Comparison for running time of the second dataset
Formats KM HC DBScan XM EM LPVector database 139 134 1553 153 1005 337Raster database 241 1478 1834 217 823 196RasterP (16 grids) 047 801 1274 045 377 144RasterP (25 grids) 035 620 1098 036 296 118
the weights for the corresponding three attributes (119909 119910 V)for each grid (119892
119894= (119909119894 119910119894 V119894)) based on defining weight of
119909 and 119910 could be varied (fine-tuned) and the sum of weightsmust be equal to 1 We tested several variations searching forthe best clustering results (1) weight of V is 20 (2) weightof V is 40 (3) weight of V is 50 (4) weight of V is 60 (5)weight of V is 80 (6) all of them have same weight at 333
(7) weight of V is 0 (8) same weight except when 119892119894(V119894= 0)
and (9) weights of 119909 and 119910 are both 0 except when 119892119894(V119894= 0)
In HC method normalization of the input data waschosen Another option available is similarity measure Itadopts Euclidean distance to measure raw numeric dataMeanwhile the other two options Jaccardrsquos coefficients andmatching coefficient are activated only when the data isbinary
For the above nine cases results of cases (1) to (6) aresimilar in their separate methods And result of (9) is theworst which does not accomplish any clustering Results ofcases (2) (3) (7) and (8) are demonstrated in Figure 10
For the distribution of clusters in the result of KMclustering method more than half of data points are clampedinto one oversized cluster The result of this method istherefore not helpful for further operation For HC methoddata on average are allocated into separate clustersThe result
10 International Journal of Distributed Sensor Networks
Table 6 Comparison for log-likelihood of second dataset
Formats KM HC DBScan XM EMVector database minus1735412 minus1962367 minus1753576 minus1721513 minus1657263Raster database minus1815926 minus2012568 minus1970756 minus1815791 minus1848209RasterP (16 grids) minus1551437 minus1724736 minus1637147 minus1701283 minus1566231RasterP (25 grids) minus1484761 minus1663789 minus1509146 minus1667312 minus1647823
1
2
5
4
41
3
32
5
(a)
1
2 3
4
5
4 51
3
2
(b)
5
5
4
1
4
1 3
3
2
2
5
(c)
2
2
4
3
3
1 5
5
4
1
(d)
Figure 10 (a) Clustering results for the first dataset with setting case (2) where weight of V is 40 top half uses KM clustering methodand bottom half uses HC method (b) Clustering results for the first dataset with setting case (3) where weight of V is 50 top half uses KMclustering method and bottom half uses HC method (c) Clustering results for the first dataset with setting case (7) where weight of V is 0top half uses KM clustering method and bottom half uses HC method (d) Clustering results for the first dataset with setting case (3) whereall share the same weight except 119892
119894(V119894= 0) top half uses KM clustering method and bottom half uses HC method
Table 7 Comparison of running time (in seconds) of four differentsizes of dataset
Dataset size KM HC DBScan XM EM LP100 grid cells 006 007 105 219 321 0184600 grid cells 042 295 3989 273 1905 93710000 grid cells 262 4667 9755 297 3785 242180000 grid cells 1975 18961 684 647 19831 9083
in Figure 10(c) is the best showing only the one with distinctposition attributes (119909 and 119910) The other three results (Figures10(a) 10(b) and 10(d)) are stained with cluster overlapsTherefore allocation of critical resource for example in eachcluster may result in a waste of resources The degree ofoverlap is the least in the result of Figure 10(b) If only locationis being considered the result of Figure 10(c) is the bestchoice Otherwise the result in Figure 10(b) is better than theother two for the sake of cluster distribution
The clustering results of the second dataset performanceby using the two methods KM and HC are shown inFigure 11
From the results of the cluster distribution of the seconddataset obtained by both clustering methods the size of eachcluster is more or less similar which is better than that ofthe first dataset And there is no overlap phenomenon inthe KM results This is a promising feature of KM methodfor spatial clustering However there is little overlap in theresult of HC method as the clusters seem to take irregularshapes Above all for the second dataset KM is a better choicefor consideration of even cluster distribution and overlapavoidance by using both clustering methods
53 Results of Grouping In this part we compare the coloredmap of Raster (119909 119910 V) data model in two datasets usingfive clustering methods in Weka and the LP method Thecommon requirement is no overlap for each of the resultingmaps The number of cluster is arbitrarily chosen at five The
International Journal of Distributed Sensor Networks 11
Table 8 Numeric results of coverage of each cluster by using the six methods for dataset 1
Cov-db1 KM EM DBScan XM HC LPCluster 0 0029436 0003786 0017902 0075178 0013153 0028985Cluster 1 0301538 0269602 0208078 0049761 0026016 0377034Cluster 2 0215277 0001627 0158439 0084049 012436 0080099Cluster 3 0046788 0096221 0079177 0209390 0001172 0217204Cluster 4 0002712 0161799 0044197 0043152 03043 0007704Total coverage 0595751 0533036 0507793 0461531 0469 0711025
41 5
3 2
(a)
4
13
5
2
(b)
Figure 11 (a) Clustering results for the second dataset by usingKMmethod (b) Clustering results for the second dataset by usingHCmethod
result of first dataset is shown in Figure 12 The first part (i)of Figure 12 shows the spatial clustering result the secondpart (ii) visualizes the corresponding spatial groups by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method The centers of the clustersare computed after clustering is done and then the groupsare visualized over the clustering results according to thecomputed centers
In Figure 12 for the results of (a) and (e) the sizes ofclusters are quite uneven more than half of dataset fall intoone cluster Thus this result reveals a fact that the techniquecannot organize a dataset into homogeneous andor well-separated groups with respect to a distance or equivalentlya similarity measureThe corresponding groups have overlapphenomenon too For the result of (c) the sizes of the clustersare uneven too For the result of (b) and (d) the sizes of clusterseem to be similar to each other There is also no overlapin the clustering result but for group result the groups in(d) have far more overlaps than those in (b) Overlap meanssome part or the cluster gets in the way of another onewhichmeans that there is superposition between two ormoredifferent clusters Again itmay cause resourcewaste and evenfalse allocation This situation occurs in important fields ofapplications such as information retrieval (several thematicfor a single document) and biological data (several metabolicfunctions for one gene) For this reason (b) is better than (d)According to the above analysis for the result of clusteringand corresponding groups (d) XM is so far the best choiceof clustering algorithm as evidenced by the colored mapsthereafter
With the same experiment setup and operating environ-ment the spatial clustering experiments are performed overthe second dataset The results of second dataset are shown
in Figure 13 where (i) represents the spatial clustering resultand (ii) represents the corresponding spatial group by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method
In Figures 13(a) and 13(e) it is noticed that the clustersare imbalanced and there are overlaps in the correspondingspatial groups using the method of (a) EM and (e) DBScanThe results of (b) KM and (d) XM however avoid theshortcomings of (a) and (e) though they still have slightoverlaps For (c) HC we remove the empty cells in theboundary to reduce the size of dataset the clustering resultis perfect There is no overlap and clusters are balancedbetween each other But there is still overlap in the spatialgroups Thus LP method is adopted to solve this problemand in possession of same size of groups The result of LPmethod yields perfectly balanced groups without any overlapas shown in Figure 13(f)
By visually comparing the clustering results of the twodatasets the clustering results seem to be similar but thespatial groups are somewhat differentOccurrence of overlapsin spatial groups is more severe in the first dataset than in thesecond one The overlaps are likely due to data distributionand balance in sizes between each cluster For the first datasetthe heavy spatial values which are traffic volumes in this caseare mainly concentrated in the center region (city center) solocations of the computed clusters tend to cram very near ata crowded spot In contrast the traffic volumes in the seconddataset are dispersed over a large area As seen from the visualspatial groups of the second dataset the cluster positions area little far apart when compared to those in the first dataset
Based on the results generated from the clustering andLP experiments some statistic information of dataset 2 iscollected and it is shown in Table 2 The numeric results in
12 International Journal of Distributed Sensor Networks
(i)
(ii)
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(ii)
(i)
(d)
(i)
(ii)
(e)
Figure 12 (a) (i) Spatial clustering on dataset 1 by using EM (ii) Spatial groups on dataset 1 from the results of using EM (b) (i) spatialclustering on dataset 1 by using KM (ii) Spatial groups on dataset 1 from the results of using KM (c) (i) spatial clustering on dataset 1 byusing HC (ii) Spatial groups on dataset 1 from the results of using HC (d) (i) spatial clustering on dataset 1 by using XM (ii) Spatial groupson dataset 1 from the results of using XM (e) (i) spatial clustering on dataset 1 by using DBScan (ii) Spatial group on dataset 1 from theresults of using DBScan
Table 3 support the qualitative analysis by visual inspectionin the previous section By comparingHC and LPmethods asan example the quantitative results show that they have thegreatest differences in cell numbers covered by the clustersalso the amount of overlap in HC is the highest of all By theLP method the size of each cluster is exactly the same andthey are totally free from overlap
6 Technical Analysis of Clustering Results
61 Experimental Evaluation Method For the purpose ofassessing how the qualities of spatial groups from clusteringare several evaluation factors are defined here running time(short for time) balance log-likelihood overlap densityand coverage For a fair comparison the datasets are run inthe same software environment of the same computer And
assume the number of groups to be five with six differentmethods Running time is the time we used to run eachmethod using the same software in the same computerto completion Balance is used to measure the sizes ofgroups if balanced the size of each group is the same Log-likelihood is an important measure for clustering qualitythe bigger the value the better Weka tests for goodness-of-fit by the likelihood in logarithm which is called log-likelihood A large log-likelihood means that the clusteringmodel is suitable for the data under test Overlap means thatthe spatial values (eg traffic volumes sensed by the sensors)do belong to more than one cluster Density is the averagespatial values (traffic volumes) per grid cell in each clusterCoverage of a cluster means the proportion of traffic volumesthat are covered by the grid cells within the cluster overthe whole dataset meanwhile total coverage is the sum of
International Journal of Distributed Sensor Networks 13
(i)
(ii)
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(i)
(ii)
(d)
(i)
(ii)
(e)
(f)
Figure 13 (a) (i) spatial clustering on dataset 2 by using EM (ii) Spatial groups on dataset 2 from the results of using EM (b) (i) spatialclustering on dataset 2 by using KM (ii) Spatial groups on dataset 2 from the results of using KM (c) (i) spatial clustering on dataset 2 byusing HC (ii) Spatial groups on dataset 2 from the results of using HC (d) (i) spatial clustering on dataset 2 by using XM (ii) Spatial groupson dataset 2 from the results of using XM (e) (i) spatial clustering on dataset 2 by using DBScan (ii) Spatial group on dataset 2 from theresults of using DBScan (f) Spatial group in LP method on dataset 2
14 International Journal of Distributed Sensor Networks
traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below
Density (cluster 119894) =sumTraffic Volumes (cluster 119894)
Grid Cell Number (cluster 119894)
Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)
sumGrid Cell Number
Total Coverage = sumTraffic Volumes minusOverlaps
Proportion of Cluster (119894) Size (Balance)
=Grid Cell Number (cluster 119894)
sumGrid Cell Number
(4)
62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively
According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest
In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one
In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14
According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On
0
100
200
300
400
500
600
700
800
900
0 20000 40000 60000 80000 100000
K-meansHierarchicalDBScanXMeanEM
Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP
K-means)
Figure 14 Comparison of running time (in seconds) of differentsizes of dataset
the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally
In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one
In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC
The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets
From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage
International Journal of Distributed Sensor Networks 15
Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2
Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700
0
01
02
03
04
05
06
07
08
KM EM DBScan XM HC LP
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
CoverageCoverage
(a)
0
01
02
03
04
05
06
KM EM DBScan XM HC LP
Coverage
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
Coverage
(b)
Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2
Total density
0
200
400
600
800
1000
1200
1400
1600
KM EM DBScan XM HC LP
Density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(a)
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
KM EM DBScan XM HC LP
Density
Total density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(b)
Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2
16 International Journal of Distributed Sensor Networks
4
5136
8
1
Balance test on dataset 1
(a) KM
1
50
1
18
30
Balance test on dataset 1
(b) XM
6
22
24
30
18
Balance test on dataset 1
(c) EM
24
24
17
20
15
Balance test on dataset 1
(d) DBScan
18
17
22
19
25
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP
From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means
that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered
International Journal of Distributed Sensor Networks 17
17
18
17
24
24
Balance test on dataset 2
(a) KM
24
18
24
18
17
Balance test on dataset 2
(b) XM
47
032
47
Balance test on dataset 2
(c) EM
1010
98
Balance test on dataset 2
(d) DBScan
23
15
24
18
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP
The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely
63 Discussion of G119899119890119905 For all the six evaluation factors each
of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in
18 International Journal of Distributed Sensor Networks
Table 10 Numeric results of density of each cluster by using the six methods for dataset 1
Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049
Table 11 Numeric results of density of each cluster by using the six methods for dataset 2
Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447
order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866
119888can take a relatively very large value or even
1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes
119866119897=
10038161003816100381610038161003816100381610038161003816
LikelihoodTime
10038161003816100381610038161003816100381610038161003816
(5)
119866119887=Difference of Balance
Time (6)
119866119889=DensityTime
(7)
119866119888=CoverageTime
(8)
119866119900=OverlapTime
(9)
119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)
Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)
From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in
Table 12 which allows us to easily compare various methodsand performance aspects
In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866
119897 119866119887 119866119889 119866119888 and 119866
119900are computed
for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13
According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion
7 Conclusion and Future Works
Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for
International Journal of Distributed Sensor Networks 19
Table 12 Performance indicators of the six methods based on dataset 2
Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0
Table 13 Comparison of different clustering and LP methods by119866net indicator
Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132
purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users
The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes
For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be
good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed
References
[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000
[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012
[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012
[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012
[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012
[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003
[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006
[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002
[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008
[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006
[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004
[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998
[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000
20 International Journal of Distributed Sensor Networks
[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004
[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005
[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010
[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998
[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006
[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999
[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001
[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009
[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999
[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000
[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996
[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963
[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
4 International Journal of Distributed Sensor Networks
Real world Vector Rasterimage
Figure 1 Representation of how a real-world spatial area is represented by vector and raster encoding formats
13121110
987654321
0 1 2 3 4 5 6 7 8 9 10 11 12 13Width
Hei
ght
Colu
mn
Row
x y coordinates are 9 3
Figure 2 Vector format
80 74 62 45 45 34 39 56
80 74 74 62 45 34 39 56
74 74 62 62 45 34 39 39
62 62 45 45 34 34 34 39
45 45 45 34 34 30 34 39
Figure 3 Raster format in ordered list
collected spatial data In this process different methods aretested for choosing the one which covers themost area as wellas the highest feature values from the suggested clustersTheflow of this process including preprocessing of sensor data
515 519 521 523
523
523517 528 527
512 510 520
518511 512 516
514
517
511
510
512 516 517 520513
Figure 4 Raster data with center point
80
80
74
62
45
74
74
74
62
45
62
74
62
45
45
45
62
62
45
34
45
45
45
34
34
34
34
34
34
30
39
39
39
34
34
56
56
39
39
39
Figure 5 Raster format with 22 and 32 grids
data transformation clustering and finding cluster center-points is shown in Figure 6
In case of a satellite image or image captured by fighter-jetor other surveillance camera image processing is needed to
International Journal of Distributed Sensor Networks 5
Load spatial image
RGB image
Gray imageSkeleton extraction
Morphological operation in MATLAB (Bwmorph)
Zhangrsquos algorithmare used forcomparison
Two-tone image
Indexed grid image
2D special data
Method comparison
Griddingindexing image
Numerical dataset (with normalization)
Spatial grouping
Hierarchical K-means LPDBScan
Color map
Output
2 algorithms
middot middot middot
Preprocessing of imageData transformation
GroupingDisplay
Figure 6 Workflow of proposed methodology
extract the density information from pictures But in our caseof sensor network we can safely assume that the data fed fromanet of sensorswould have the sensor ID attachedThe sensorIDs are known so are their positions From the locations ofthe sensors and their sensor ID we could possibly relate thedata that was collected to their corresponding locations inthe 119909-119910 format of coordinates (assume the terrain is of 2D)In order to reduce the huge amount of calculation and storagespace a grid was used to divide the whole map into smallerpieces The grid indexing operation is repeated for a range ofdifferent coarse layers thereby providing different resolutionsof data partitions Similar technique is reported in [22] whichis computed by Euclidian distance Obviously the methodof grid indexing helps separate data into cells based on theirgeographic locations
To obtain a better result of spatial groups for maximumcoverage and its corresponding cluster center point with cer-tain constrains the research adopts several popular cluster-ing methods and linear programming method by using soft-ware programs such as XLMiner (httpwwwsolvercomxlminer-data-mining) MATLAB (httpwwwmathworkscomproductsmatlab) and Weka (httpwwwcswaikatoacnzmlweka)
The core purpose of cluster analysis is to comprehend andto distinguish the extent of similarity or dissimilarity amountof the independent clustered objects There are five majormethods of clusteringmdashKM EM XM HC and DBScan
119870-means (KM) byMacQueen 1967 is one of the simplestalgorithms that solve thewell-known clustering problem [23]It is an easy and simple method to divide a dataset into acertain number of clusters initially assuming that the numberof clusters is 119896 fixed a priori for each cluster which is themainidea The random choice of the initial location of centroids
leads to various results A better choice is to place them asmuch far away from each other as possible
The KM algorithm aims at minimizing an objective fun-ction In this case a squared error function is as follows
119895 = sum
forall119894
sum
forall119895
10038171003817100381710038171003817119909119894(119895) minus 119888
119895
10038171003817100381710038171003817
2
(1)
where 119895 ranges from 1 to 119896 119894 range from 1 to 119899 and119909119894(119895) minus 119888
1198952 is a chosen distance measure between a data
point 119909119894(119895) and the cluster center 119888
119895 which is an indicator of
the distance of the 119899 data points from their respective clustercenters The sum of distances or sum of squared Euclideandistances from the mean of each cluster is a quite normal orusual measure for causing scattering in all directions in thecluster in order to test the suitability of the KM algorithmClusters are often computed using a fast heuristic methodwhich generally produces good (but not necessarily optimal)solutions
X-Means [24] is an optimal method of KM whichimproves structure part in the algorithm Division of thecenters is attempted in its region It makes decision betweenthe root and children of each center to doing the comparisonbetween the two structures Another improved variant ofKM called EM which execrates maximization makes anassignment on a probability distribution to each further pointwhich represents the probability Howmany clusters to be setup are to be decided by EM using cross-validation
Density-based algorithms regard clusters as dense areasof objects that are separated by less dense areas [25] Becausethey have no limit to look for clusters with spherical shapethey can produce clusters with arbitrary shapes DBScan isa typical implementation of density-based algorithms calleddensity-based spatial clustering of applications with noise
6 International Journal of Distributed Sensor Networks
(a) (b) (c)
Figure 7 Illustration of possible ways of assigning clusters for maximum (a) fish population (b) altitude of terrain and (c) human inhabitantpopulation
[25] The notions of density reachability and density con-nectivity are used as performance indicators for the qualityof clustering [26] A cluster is composed of the group ofobjects in a dataset that are density connected to a particularcenter Any object that falls beyond a cluster is considered asnoise
Ward proposed a clustering method called hierarchicalclustering (HC) in 1963 [27] It tries to find how to formsomething to divide 119875
119899 119875119899minus1
1198751in a way that reduces
the relationship with each group In each step analysisstep it considered every possible cluster pair in group andcombined the two clusters with a very close joining of resultsin ldquoinformation lossrdquo which is given definition by Wardaround ESS (an error sum-of-squares criterion)The idea thatsupports Wardrsquos proposal can be described most simply bythinking of a little single data Take ten objects with scores asan example (2 7 6 6 7 2 2 0 2 0) The loss of informationwould be achieved by calculating ESS with a mean of 34which takes into account the ten scores as a unit as followsESS One group = (2 minus 34)
2+ (7 minus 34)
2+ sdot sdot sdot + (0 minus 34)
2=
4728 However those 10 objects can also be separated intofour groups according to their scores 0 0 0 2 2 2 26 6 and 7 7 Finally for evaluation of the ESS as a sum ofsquares we can obtain four independent error sums of eachsquare Overall the result that divides the 10 objects into 4clusters has no loss of information as follows
ESS One group = ESS group1 + ESS group2
+ ESS group3 + ESS group4 = 0
(2)
The last method we adopted here is linear programming(LP) which contains instituting and producing an answerto optimization problems with linear objective functionsand linear constraints This powerful tool can be used inmany fields especially where many options are possible inthe answers In spatial grouping over a large grid manypossible combinations of positioning the clusters exist Theproblem here is to find a certain number of clusters of
equal size over the area meanwhile the chosen centers ofthe clusters must allow sufficient distance apart from eachother so as to avoid overlapping As an example shownin Figure 7 three clusters would have to be assigned overa spatial area in a way that they would have to covercertain resources The assignment of the clusters howeverwould have to yield a maximum total value summed fromcovered resources In the example the left diagram showsallocating three clusters over the deepwater assuming that theresources are fish hence the maximum harvest The secondexample in the middle of Figure 7 is clustering the highaltitude over the area The last example is trying to coverthe maximum human inhabitants which are concentratedat the coves Given many possible ways of setting up theseclusters LP is used to formulate this allocation problemwith an objective of maximizing the values of the coveredresources
Assuming that the resources could be dynamic forexample animal herds or moving targets whose positionsmay swarm and change over time the optimization is atypical maximal flow problem (or max flow problem) Theoptimization is a type of network flow problem in whichthe goal is to determine the maximum amount of flowthat can occur over arc whish is limited by some capacityrestriction This type of network might be used to modelthe flow of oil in pipeline (in which the amount of oil thatcan flow through a pipe in a unit of time is limited by thediameter of the pipe) Traffic engineers also use this type ofnetwork to determine the maximum number of cars that cantravel through a collection of streets with different capacitiesimposed by the number of lanes in the streets and speed limits[28]
For our spatial clustering we consider each cell of the gridas a node each node is defined as a tuple119898 that contains thecoordinates and the value of the resource that is held in thenode such that 119898(119909
119894 119910119894 119911119894) represents an 119894th node in which
119909119894 119910119894represent the position and 119911
119894represents the value of
resource in the node respectively For the clusters each node
International Journal of Distributed Sensor Networks 7
(1) Load the grid-based spatial information into array 119860(119909 119910 119911) 119860 is a three dimensional array(2) Repeat (through all coordinates of 119909)(3) Repeat (through all coordinates of 119910)(4) If (boundary constraints and overlapping constraints are satisfied) Then(5) 119878(119909
119894 119910119894 119911119894) = 119860(119909
119894 119910119894 119911119894)
(6) End-if(7) End-loop(8) End-loop(9) If size of (119878) ge 119870
(10) Find top 119870 clusters where maxsum119911119894⨁119862119896 copy 119878(119909
119894 119910119894 119911119894) to new array 119862(119909
119894 119910119894 119911119894) forall119894 isin 119862
119896
(11) Else-if(12) 119862(119909
119894 119910119894 119911119894) = 119878(119909
119894 119910119894 119911119894) forall119894
(13) End-if
Pseudocode 1 Pseudocode of the proposed LP model for spatial clustering
can potentially be a center of a cluster and the cluster hasa fixed radius of length 119903 The LP model for our problem ismathematically shown as follows
Total value = ⋃
selected clusters ⟨119862119896|119896=1sdotsdotsdot119870⟩sum
119898119894isin119862119896
119898119894(lowast lowast 119911
119894)
= argmax119883119884
sum
0le119909119894le119883
0le119910119894le119884
119870
sum
119896=1
119911119897ni 119898119897(119909119894 119910119895 119911) oplus 119888
119896
(3)
Subject to the boundary constraints of 2r le |119909119894minus 119909119895| and 2r
le |119909119894minus 119909119895| for all 119894 and 119895 but 119894 = 119895 where119883 is the maximum
width and 119884 is the maximum length of the 2D spatial arearespectively 119896 isin 119870 is the maximum number of clusters and119888119896is the 119896th cluster under consideration in the optimizationIn order to implement the computation as depicted in
(3) for each node we sum each group resources in a shapeof diamond (which geometrically approximates a circle) Byiterating through every combination of119870 nodes in the grid ofsize 119883 by 119884 each current node in the combinations is beingtested by considering it as the center of a cluster that has aradius of r hence storing the resource values of the nodesfrom the potential clusters into a temporary array buffer119860(lowast lowast 119911
119894) The results from those potential clusters which
do satisfy the boundary and nonoverlapping constraints arethen copied to a candidate buffer Out of the clusters whoseresource values are stored in the candidate buffer 119878 thecombination of 119870 clusters that has the great total resourcevalue is selected and their values are placed in the final buffer119862 The corresponding pseudocode is shown in Pseudocode 1
5 Experimental Results and Analysis
In this section the performance of the proposed methodol-ogy is shown by presenting both numerical and visualizedresults for all performance aspects over various algorithms Acase study of road traffic is used in the experimentThe spatialarea is a metropolitan traffic map with roads and streetsspanning all over the place The resource value in this case isthe concentration or density of vehicle traffic flows Sensorsare assumed to have been deployed in every appropriate point
Table 1 Comparison between Bwmorph function and thinningalgorithm
Bwmorph function Thinning algorithmDataset 1 Dataset 2 Dataset 1 Dataset 2
Degree ofthinning Incomplete Complete
Elapsed time(secs) 20 38 100 198
Complexity 119874(119899) 119874(1198992)
of the roads thereby a typical traffic volume is each of thesepoints is known The optimization of spatial clustering inthis case can be thought of as optimal resource allocationfor example cost-effective police patrols gas stations orenvironment-pollution controls are needed among thosedense traffic spots
51 Data Preprocessing Two different factual datasets areused for experiments The first dataset is published byMaricopa Association of Governments in 2008 which isa traffic volume map Traffic volumes were derived fromthe national traffic recording devices Seasonal variation isfactored into the volumes The second dataset is an annualaverage daily traffic of Baltimore County Traffic VolumeMapin 2011 in USA prepared by the Maryland Department ofTransportation and published by March 19 2012 The trafficcount estimates are derived by taking 48-hourmachine countdata and applying factors frompermanent count stationsThetraffic counts represent the resource values in a general sense
After using skeleton extraction a two-tone image wasobtained from the original map Readers are referred to therespective websites where they can see the traffic volume datathat are associated with our two datasets (a) Representativetraffic volume map of dataset 1mdashTraffic Volume Map ofPhoenix AZUSA (httpphoenixgovstreetstrafficvolume-map) (b) Representative traffic volume map of dataset2mdashTraffic Volume Map of Baltimore MD USA (httpwwwmarylandroadscomTraffic Volume MapsTraffic VolumeMapspdf) And the corresponding result skeleton extraction
8 International Journal of Distributed Sensor Networks
Table 2 Important statistics from the clustering and LP experiments
Method Cluster number Number of cells covered Minimum Maximum Overlap
KM
Cluster 1 428 0 3499327 0Cluster 2 468 0 546896 0Cluster 3 448 0 20503007 0Cluster 4 614 0 6894667 0Cluster 5 618 0 900908 0
XM
Cluster 1 615 0 591265 0Cluster 2 457 0 546896 0Cluster 3 609 0 900908 0Cluster 4 465 0 3499327 0Cluster 5 430 0 20503007 0
EM
Cluster 1 1223 0 2292 61817229Cluster 2 7 141048 243705 313018Cluster 3 81 0 3033733 131146577Cluster 4 64 26752 546896 330881249Cluster 5 1201 0 1300026 217950471
DB
Cluster 1 13 23614 33146 327222911Cluster 2 11 1686825 21001 363965818Cluster 3 13 178888 2945283 196118393Cluster 4 11 847733 211008 58940877Cluster 5 2528 0 546896 20554176
HC
Cluster 1 291 0 3499327 0Cluster 2 191 0 20503007 96762283Cluster 3 294 0 1590971 0Cluster 4 224 0 189812 12673555Cluster 5 243 0 546896 0
LP
Cluster 1 221 0 3499327 0Cluster 2 221 0 20503007 0Cluster 3 221 0 1590971 0Cluster 4 221 0 189812 0Cluster 5 221 0 546896 0
Table 3 Comparison for running time of the first dataset
Formats KM HC DBscan XM EM LPVector database 327 1252 2324 278 930 183Raster database 342 1536 2820 284 984 201RasterP (16 grids) 198 134 508 046 057 078RasterP (25 grids) 009 014 115 021 012 053
in dataset 1 is shown in Figure 8 where (a) adopted a kind ofmorphological operation method and (b) adopted thinningalgorithm respectively Likewise the corresponding resultskeleton extraction in the second dataset is shown inFigure 9 where (a) adopted a kind of morphologicaloperation method and (b) adopted thinning algorithmrespectively The comparison result of the two datasets isshown in Table 1
For the raw dataset we firstly perform the image prepro-cessing over it to obtain numerical database
The results of the skeleton extraction as shown in Figures8(b) and 9(b) are more clearly and useful for the following
processing Subsequently the clustering by grid can bereadily obtained from the preprocessed imagesThe extent ofimage thinning is better and more complete by the thinningalgorithm than the Bwmorph function in MATLAB But theelapsed time is longer due to a two-layer iteration nestingprocedure in the program code
The choice of placing a grid on the image follows oneprinciple mesh segmentation is not trying to fall on a con-centrated position of traffic flow Since there is no endpointthe midpoint of the two adjacent values was considered ademarcation point Under this assumption the traffic flow ineach grid is calculated and stored digitally in an Excel file Adigital data for the trafficmap serves as the initial data for thesubsequent clustering process
52 Comparison Result of KM and HC Clustering InXLMiner two methods were used to perform clustering KMand HC In order to compare the two methods for the twodatasets input variables were normalized and the numberof clusters is set at five and maximum iterations at 100 Theinitial centroids are chosen randomly at start Furthermore
International Journal of Distributed Sensor Networks 9
Table 4 Comparison for log-likelihood of first dataset
Formats KM HC DBScan XM EMVector database minus1241868 minus1407265 minus1328599 minus119533 minus1249562Raster database minus1342238 minus1502863 minus1378889 minus129632 minus1339769RasterP (16 grids) 1262264 minus1402266 minus1248583 minus1239419 minus1244993RasterP (25 grids) minus1241868 minus1319417 minus1122207 minus1248201 minus1162048
(a) (b)
Figure 8 (a) Result of skeleton extraction in dataset 1 using Bwmorph function (b) Result of skeleton extraction in dataset 1 using thinningalgorithm
(a) (b)
Figure 9 (a) Result of skeleton extraction in dataset 2 using Bwmorph function (b) Result of skeleton extraction in dataset 2 using thinningalgorithm
Table 5 Comparison for running time of the second dataset
Formats KM HC DBScan XM EM LPVector database 139 134 1553 153 1005 337Raster database 241 1478 1834 217 823 196RasterP (16 grids) 047 801 1274 045 377 144RasterP (25 grids) 035 620 1098 036 296 118
the weights for the corresponding three attributes (119909 119910 V)for each grid (119892
119894= (119909119894 119910119894 V119894)) based on defining weight of
119909 and 119910 could be varied (fine-tuned) and the sum of weightsmust be equal to 1 We tested several variations searching forthe best clustering results (1) weight of V is 20 (2) weightof V is 40 (3) weight of V is 50 (4) weight of V is 60 (5)weight of V is 80 (6) all of them have same weight at 333
(7) weight of V is 0 (8) same weight except when 119892119894(V119894= 0)
and (9) weights of 119909 and 119910 are both 0 except when 119892119894(V119894= 0)
In HC method normalization of the input data waschosen Another option available is similarity measure Itadopts Euclidean distance to measure raw numeric dataMeanwhile the other two options Jaccardrsquos coefficients andmatching coefficient are activated only when the data isbinary
For the above nine cases results of cases (1) to (6) aresimilar in their separate methods And result of (9) is theworst which does not accomplish any clustering Results ofcases (2) (3) (7) and (8) are demonstrated in Figure 10
For the distribution of clusters in the result of KMclustering method more than half of data points are clampedinto one oversized cluster The result of this method istherefore not helpful for further operation For HC methoddata on average are allocated into separate clustersThe result
10 International Journal of Distributed Sensor Networks
Table 6 Comparison for log-likelihood of second dataset
Formats KM HC DBScan XM EMVector database minus1735412 minus1962367 minus1753576 minus1721513 minus1657263Raster database minus1815926 minus2012568 minus1970756 minus1815791 minus1848209RasterP (16 grids) minus1551437 minus1724736 minus1637147 minus1701283 minus1566231RasterP (25 grids) minus1484761 minus1663789 minus1509146 minus1667312 minus1647823
1
2
5
4
41
3
32
5
(a)
1
2 3
4
5
4 51
3
2
(b)
5
5
4
1
4
1 3
3
2
2
5
(c)
2
2
4
3
3
1 5
5
4
1
(d)
Figure 10 (a) Clustering results for the first dataset with setting case (2) where weight of V is 40 top half uses KM clustering methodand bottom half uses HC method (b) Clustering results for the first dataset with setting case (3) where weight of V is 50 top half uses KMclustering method and bottom half uses HC method (c) Clustering results for the first dataset with setting case (7) where weight of V is 0top half uses KM clustering method and bottom half uses HC method (d) Clustering results for the first dataset with setting case (3) whereall share the same weight except 119892
119894(V119894= 0) top half uses KM clustering method and bottom half uses HC method
Table 7 Comparison of running time (in seconds) of four differentsizes of dataset
Dataset size KM HC DBScan XM EM LP100 grid cells 006 007 105 219 321 0184600 grid cells 042 295 3989 273 1905 93710000 grid cells 262 4667 9755 297 3785 242180000 grid cells 1975 18961 684 647 19831 9083
in Figure 10(c) is the best showing only the one with distinctposition attributes (119909 and 119910) The other three results (Figures10(a) 10(b) and 10(d)) are stained with cluster overlapsTherefore allocation of critical resource for example in eachcluster may result in a waste of resources The degree ofoverlap is the least in the result of Figure 10(b) If only locationis being considered the result of Figure 10(c) is the bestchoice Otherwise the result in Figure 10(b) is better than theother two for the sake of cluster distribution
The clustering results of the second dataset performanceby using the two methods KM and HC are shown inFigure 11
From the results of the cluster distribution of the seconddataset obtained by both clustering methods the size of eachcluster is more or less similar which is better than that ofthe first dataset And there is no overlap phenomenon inthe KM results This is a promising feature of KM methodfor spatial clustering However there is little overlap in theresult of HC method as the clusters seem to take irregularshapes Above all for the second dataset KM is a better choicefor consideration of even cluster distribution and overlapavoidance by using both clustering methods
53 Results of Grouping In this part we compare the coloredmap of Raster (119909 119910 V) data model in two datasets usingfive clustering methods in Weka and the LP method Thecommon requirement is no overlap for each of the resultingmaps The number of cluster is arbitrarily chosen at five The
International Journal of Distributed Sensor Networks 11
Table 8 Numeric results of coverage of each cluster by using the six methods for dataset 1
Cov-db1 KM EM DBScan XM HC LPCluster 0 0029436 0003786 0017902 0075178 0013153 0028985Cluster 1 0301538 0269602 0208078 0049761 0026016 0377034Cluster 2 0215277 0001627 0158439 0084049 012436 0080099Cluster 3 0046788 0096221 0079177 0209390 0001172 0217204Cluster 4 0002712 0161799 0044197 0043152 03043 0007704Total coverage 0595751 0533036 0507793 0461531 0469 0711025
41 5
3 2
(a)
4
13
5
2
(b)
Figure 11 (a) Clustering results for the second dataset by usingKMmethod (b) Clustering results for the second dataset by usingHCmethod
result of first dataset is shown in Figure 12 The first part (i)of Figure 12 shows the spatial clustering result the secondpart (ii) visualizes the corresponding spatial groups by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method The centers of the clustersare computed after clustering is done and then the groupsare visualized over the clustering results according to thecomputed centers
In Figure 12 for the results of (a) and (e) the sizes ofclusters are quite uneven more than half of dataset fall intoone cluster Thus this result reveals a fact that the techniquecannot organize a dataset into homogeneous andor well-separated groups with respect to a distance or equivalentlya similarity measureThe corresponding groups have overlapphenomenon too For the result of (c) the sizes of the clustersare uneven too For the result of (b) and (d) the sizes of clusterseem to be similar to each other There is also no overlapin the clustering result but for group result the groups in(d) have far more overlaps than those in (b) Overlap meanssome part or the cluster gets in the way of another onewhichmeans that there is superposition between two ormoredifferent clusters Again itmay cause resourcewaste and evenfalse allocation This situation occurs in important fields ofapplications such as information retrieval (several thematicfor a single document) and biological data (several metabolicfunctions for one gene) For this reason (b) is better than (d)According to the above analysis for the result of clusteringand corresponding groups (d) XM is so far the best choiceof clustering algorithm as evidenced by the colored mapsthereafter
With the same experiment setup and operating environ-ment the spatial clustering experiments are performed overthe second dataset The results of second dataset are shown
in Figure 13 where (i) represents the spatial clustering resultand (ii) represents the corresponding spatial group by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method
In Figures 13(a) and 13(e) it is noticed that the clustersare imbalanced and there are overlaps in the correspondingspatial groups using the method of (a) EM and (e) DBScanThe results of (b) KM and (d) XM however avoid theshortcomings of (a) and (e) though they still have slightoverlaps For (c) HC we remove the empty cells in theboundary to reduce the size of dataset the clustering resultis perfect There is no overlap and clusters are balancedbetween each other But there is still overlap in the spatialgroups Thus LP method is adopted to solve this problemand in possession of same size of groups The result of LPmethod yields perfectly balanced groups without any overlapas shown in Figure 13(f)
By visually comparing the clustering results of the twodatasets the clustering results seem to be similar but thespatial groups are somewhat differentOccurrence of overlapsin spatial groups is more severe in the first dataset than in thesecond one The overlaps are likely due to data distributionand balance in sizes between each cluster For the first datasetthe heavy spatial values which are traffic volumes in this caseare mainly concentrated in the center region (city center) solocations of the computed clusters tend to cram very near ata crowded spot In contrast the traffic volumes in the seconddataset are dispersed over a large area As seen from the visualspatial groups of the second dataset the cluster positions area little far apart when compared to those in the first dataset
Based on the results generated from the clustering andLP experiments some statistic information of dataset 2 iscollected and it is shown in Table 2 The numeric results in
12 International Journal of Distributed Sensor Networks
(i)
(ii)
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(ii)
(i)
(d)
(i)
(ii)
(e)
Figure 12 (a) (i) Spatial clustering on dataset 1 by using EM (ii) Spatial groups on dataset 1 from the results of using EM (b) (i) spatialclustering on dataset 1 by using KM (ii) Spatial groups on dataset 1 from the results of using KM (c) (i) spatial clustering on dataset 1 byusing HC (ii) Spatial groups on dataset 1 from the results of using HC (d) (i) spatial clustering on dataset 1 by using XM (ii) Spatial groupson dataset 1 from the results of using XM (e) (i) spatial clustering on dataset 1 by using DBScan (ii) Spatial group on dataset 1 from theresults of using DBScan
Table 3 support the qualitative analysis by visual inspectionin the previous section By comparingHC and LPmethods asan example the quantitative results show that they have thegreatest differences in cell numbers covered by the clustersalso the amount of overlap in HC is the highest of all By theLP method the size of each cluster is exactly the same andthey are totally free from overlap
6 Technical Analysis of Clustering Results
61 Experimental Evaluation Method For the purpose ofassessing how the qualities of spatial groups from clusteringare several evaluation factors are defined here running time(short for time) balance log-likelihood overlap densityand coverage For a fair comparison the datasets are run inthe same software environment of the same computer And
assume the number of groups to be five with six differentmethods Running time is the time we used to run eachmethod using the same software in the same computerto completion Balance is used to measure the sizes ofgroups if balanced the size of each group is the same Log-likelihood is an important measure for clustering qualitythe bigger the value the better Weka tests for goodness-of-fit by the likelihood in logarithm which is called log-likelihood A large log-likelihood means that the clusteringmodel is suitable for the data under test Overlap means thatthe spatial values (eg traffic volumes sensed by the sensors)do belong to more than one cluster Density is the averagespatial values (traffic volumes) per grid cell in each clusterCoverage of a cluster means the proportion of traffic volumesthat are covered by the grid cells within the cluster overthe whole dataset meanwhile total coverage is the sum of
International Journal of Distributed Sensor Networks 13
(i)
(ii)
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(i)
(ii)
(d)
(i)
(ii)
(e)
(f)
Figure 13 (a) (i) spatial clustering on dataset 2 by using EM (ii) Spatial groups on dataset 2 from the results of using EM (b) (i) spatialclustering on dataset 2 by using KM (ii) Spatial groups on dataset 2 from the results of using KM (c) (i) spatial clustering on dataset 2 byusing HC (ii) Spatial groups on dataset 2 from the results of using HC (d) (i) spatial clustering on dataset 2 by using XM (ii) Spatial groupson dataset 2 from the results of using XM (e) (i) spatial clustering on dataset 2 by using DBScan (ii) Spatial group on dataset 2 from theresults of using DBScan (f) Spatial group in LP method on dataset 2
14 International Journal of Distributed Sensor Networks
traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below
Density (cluster 119894) =sumTraffic Volumes (cluster 119894)
Grid Cell Number (cluster 119894)
Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)
sumGrid Cell Number
Total Coverage = sumTraffic Volumes minusOverlaps
Proportion of Cluster (119894) Size (Balance)
=Grid Cell Number (cluster 119894)
sumGrid Cell Number
(4)
62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively
According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest
In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one
In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14
According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On
0
100
200
300
400
500
600
700
800
900
0 20000 40000 60000 80000 100000
K-meansHierarchicalDBScanXMeanEM
Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP
K-means)
Figure 14 Comparison of running time (in seconds) of differentsizes of dataset
the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally
In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one
In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC
The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets
From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage
International Journal of Distributed Sensor Networks 15
Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2
Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700
0
01
02
03
04
05
06
07
08
KM EM DBScan XM HC LP
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
CoverageCoverage
(a)
0
01
02
03
04
05
06
KM EM DBScan XM HC LP
Coverage
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
Coverage
(b)
Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2
Total density
0
200
400
600
800
1000
1200
1400
1600
KM EM DBScan XM HC LP
Density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(a)
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
KM EM DBScan XM HC LP
Density
Total density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(b)
Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2
16 International Journal of Distributed Sensor Networks
4
5136
8
1
Balance test on dataset 1
(a) KM
1
50
1
18
30
Balance test on dataset 1
(b) XM
6
22
24
30
18
Balance test on dataset 1
(c) EM
24
24
17
20
15
Balance test on dataset 1
(d) DBScan
18
17
22
19
25
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP
From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means
that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered
International Journal of Distributed Sensor Networks 17
17
18
17
24
24
Balance test on dataset 2
(a) KM
24
18
24
18
17
Balance test on dataset 2
(b) XM
47
032
47
Balance test on dataset 2
(c) EM
1010
98
Balance test on dataset 2
(d) DBScan
23
15
24
18
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP
The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely
63 Discussion of G119899119890119905 For all the six evaluation factors each
of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in
18 International Journal of Distributed Sensor Networks
Table 10 Numeric results of density of each cluster by using the six methods for dataset 1
Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049
Table 11 Numeric results of density of each cluster by using the six methods for dataset 2
Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447
order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866
119888can take a relatively very large value or even
1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes
119866119897=
10038161003816100381610038161003816100381610038161003816
LikelihoodTime
10038161003816100381610038161003816100381610038161003816
(5)
119866119887=Difference of Balance
Time (6)
119866119889=DensityTime
(7)
119866119888=CoverageTime
(8)
119866119900=OverlapTime
(9)
119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)
Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)
From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in
Table 12 which allows us to easily compare various methodsand performance aspects
In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866
119897 119866119887 119866119889 119866119888 and 119866
119900are computed
for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13
According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion
7 Conclusion and Future Works
Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for
International Journal of Distributed Sensor Networks 19
Table 12 Performance indicators of the six methods based on dataset 2
Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0
Table 13 Comparison of different clustering and LP methods by119866net indicator
Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132
purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users
The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes
For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be
good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed
References
[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000
[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012
[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012
[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012
[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012
[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003
[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006
[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002
[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008
[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006
[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004
[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998
[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000
20 International Journal of Distributed Sensor Networks
[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004
[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005
[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010
[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998
[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006
[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999
[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001
[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009
[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999
[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000
[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996
[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963
[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
International Journal of Distributed Sensor Networks 5
Load spatial image
RGB image
Gray imageSkeleton extraction
Morphological operation in MATLAB (Bwmorph)
Zhangrsquos algorithmare used forcomparison
Two-tone image
Indexed grid image
2D special data
Method comparison
Griddingindexing image
Numerical dataset (with normalization)
Spatial grouping
Hierarchical K-means LPDBScan
Color map
Output
2 algorithms
middot middot middot
Preprocessing of imageData transformation
GroupingDisplay
Figure 6 Workflow of proposed methodology
extract the density information from pictures But in our caseof sensor network we can safely assume that the data fed fromanet of sensorswould have the sensor ID attachedThe sensorIDs are known so are their positions From the locations ofthe sensors and their sensor ID we could possibly relate thedata that was collected to their corresponding locations inthe 119909-119910 format of coordinates (assume the terrain is of 2D)In order to reduce the huge amount of calculation and storagespace a grid was used to divide the whole map into smallerpieces The grid indexing operation is repeated for a range ofdifferent coarse layers thereby providing different resolutionsof data partitions Similar technique is reported in [22] whichis computed by Euclidian distance Obviously the methodof grid indexing helps separate data into cells based on theirgeographic locations
To obtain a better result of spatial groups for maximumcoverage and its corresponding cluster center point with cer-tain constrains the research adopts several popular cluster-ing methods and linear programming method by using soft-ware programs such as XLMiner (httpwwwsolvercomxlminer-data-mining) MATLAB (httpwwwmathworkscomproductsmatlab) and Weka (httpwwwcswaikatoacnzmlweka)
The core purpose of cluster analysis is to comprehend andto distinguish the extent of similarity or dissimilarity amountof the independent clustered objects There are five majormethods of clusteringmdashKM EM XM HC and DBScan
119870-means (KM) byMacQueen 1967 is one of the simplestalgorithms that solve thewell-known clustering problem [23]It is an easy and simple method to divide a dataset into acertain number of clusters initially assuming that the numberof clusters is 119896 fixed a priori for each cluster which is themainidea The random choice of the initial location of centroids
leads to various results A better choice is to place them asmuch far away from each other as possible
The KM algorithm aims at minimizing an objective fun-ction In this case a squared error function is as follows
119895 = sum
forall119894
sum
forall119895
10038171003817100381710038171003817119909119894(119895) minus 119888
119895
10038171003817100381710038171003817
2
(1)
where 119895 ranges from 1 to 119896 119894 range from 1 to 119899 and119909119894(119895) minus 119888
1198952 is a chosen distance measure between a data
point 119909119894(119895) and the cluster center 119888
119895 which is an indicator of
the distance of the 119899 data points from their respective clustercenters The sum of distances or sum of squared Euclideandistances from the mean of each cluster is a quite normal orusual measure for causing scattering in all directions in thecluster in order to test the suitability of the KM algorithmClusters are often computed using a fast heuristic methodwhich generally produces good (but not necessarily optimal)solutions
X-Means [24] is an optimal method of KM whichimproves structure part in the algorithm Division of thecenters is attempted in its region It makes decision betweenthe root and children of each center to doing the comparisonbetween the two structures Another improved variant ofKM called EM which execrates maximization makes anassignment on a probability distribution to each further pointwhich represents the probability Howmany clusters to be setup are to be decided by EM using cross-validation
Density-based algorithms regard clusters as dense areasof objects that are separated by less dense areas [25] Becausethey have no limit to look for clusters with spherical shapethey can produce clusters with arbitrary shapes DBScan isa typical implementation of density-based algorithms calleddensity-based spatial clustering of applications with noise
6 International Journal of Distributed Sensor Networks
(a) (b) (c)
Figure 7 Illustration of possible ways of assigning clusters for maximum (a) fish population (b) altitude of terrain and (c) human inhabitantpopulation
[25] The notions of density reachability and density con-nectivity are used as performance indicators for the qualityof clustering [26] A cluster is composed of the group ofobjects in a dataset that are density connected to a particularcenter Any object that falls beyond a cluster is considered asnoise
Ward proposed a clustering method called hierarchicalclustering (HC) in 1963 [27] It tries to find how to formsomething to divide 119875
119899 119875119899minus1
1198751in a way that reduces
the relationship with each group In each step analysisstep it considered every possible cluster pair in group andcombined the two clusters with a very close joining of resultsin ldquoinformation lossrdquo which is given definition by Wardaround ESS (an error sum-of-squares criterion)The idea thatsupports Wardrsquos proposal can be described most simply bythinking of a little single data Take ten objects with scores asan example (2 7 6 6 7 2 2 0 2 0) The loss of informationwould be achieved by calculating ESS with a mean of 34which takes into account the ten scores as a unit as followsESS One group = (2 minus 34)
2+ (7 minus 34)
2+ sdot sdot sdot + (0 minus 34)
2=
4728 However those 10 objects can also be separated intofour groups according to their scores 0 0 0 2 2 2 26 6 and 7 7 Finally for evaluation of the ESS as a sum ofsquares we can obtain four independent error sums of eachsquare Overall the result that divides the 10 objects into 4clusters has no loss of information as follows
ESS One group = ESS group1 + ESS group2
+ ESS group3 + ESS group4 = 0
(2)
The last method we adopted here is linear programming(LP) which contains instituting and producing an answerto optimization problems with linear objective functionsand linear constraints This powerful tool can be used inmany fields especially where many options are possible inthe answers In spatial grouping over a large grid manypossible combinations of positioning the clusters exist Theproblem here is to find a certain number of clusters of
equal size over the area meanwhile the chosen centers ofthe clusters must allow sufficient distance apart from eachother so as to avoid overlapping As an example shownin Figure 7 three clusters would have to be assigned overa spatial area in a way that they would have to covercertain resources The assignment of the clusters howeverwould have to yield a maximum total value summed fromcovered resources In the example the left diagram showsallocating three clusters over the deepwater assuming that theresources are fish hence the maximum harvest The secondexample in the middle of Figure 7 is clustering the highaltitude over the area The last example is trying to coverthe maximum human inhabitants which are concentratedat the coves Given many possible ways of setting up theseclusters LP is used to formulate this allocation problemwith an objective of maximizing the values of the coveredresources
Assuming that the resources could be dynamic forexample animal herds or moving targets whose positionsmay swarm and change over time the optimization is atypical maximal flow problem (or max flow problem) Theoptimization is a type of network flow problem in whichthe goal is to determine the maximum amount of flowthat can occur over arc whish is limited by some capacityrestriction This type of network might be used to modelthe flow of oil in pipeline (in which the amount of oil thatcan flow through a pipe in a unit of time is limited by thediameter of the pipe) Traffic engineers also use this type ofnetwork to determine the maximum number of cars that cantravel through a collection of streets with different capacitiesimposed by the number of lanes in the streets and speed limits[28]
For our spatial clustering we consider each cell of the gridas a node each node is defined as a tuple119898 that contains thecoordinates and the value of the resource that is held in thenode such that 119898(119909
119894 119910119894 119911119894) represents an 119894th node in which
119909119894 119910119894represent the position and 119911
119894represents the value of
resource in the node respectively For the clusters each node
International Journal of Distributed Sensor Networks 7
(1) Load the grid-based spatial information into array 119860(119909 119910 119911) 119860 is a three dimensional array(2) Repeat (through all coordinates of 119909)(3) Repeat (through all coordinates of 119910)(4) If (boundary constraints and overlapping constraints are satisfied) Then(5) 119878(119909
119894 119910119894 119911119894) = 119860(119909
119894 119910119894 119911119894)
(6) End-if(7) End-loop(8) End-loop(9) If size of (119878) ge 119870
(10) Find top 119870 clusters where maxsum119911119894⨁119862119896 copy 119878(119909
119894 119910119894 119911119894) to new array 119862(119909
119894 119910119894 119911119894) forall119894 isin 119862
119896
(11) Else-if(12) 119862(119909
119894 119910119894 119911119894) = 119878(119909
119894 119910119894 119911119894) forall119894
(13) End-if
Pseudocode 1 Pseudocode of the proposed LP model for spatial clustering
can potentially be a center of a cluster and the cluster hasa fixed radius of length 119903 The LP model for our problem ismathematically shown as follows
Total value = ⋃
selected clusters ⟨119862119896|119896=1sdotsdotsdot119870⟩sum
119898119894isin119862119896
119898119894(lowast lowast 119911
119894)
= argmax119883119884
sum
0le119909119894le119883
0le119910119894le119884
119870
sum
119896=1
119911119897ni 119898119897(119909119894 119910119895 119911) oplus 119888
119896
(3)
Subject to the boundary constraints of 2r le |119909119894minus 119909119895| and 2r
le |119909119894minus 119909119895| for all 119894 and 119895 but 119894 = 119895 where119883 is the maximum
width and 119884 is the maximum length of the 2D spatial arearespectively 119896 isin 119870 is the maximum number of clusters and119888119896is the 119896th cluster under consideration in the optimizationIn order to implement the computation as depicted in
(3) for each node we sum each group resources in a shapeof diamond (which geometrically approximates a circle) Byiterating through every combination of119870 nodes in the grid ofsize 119883 by 119884 each current node in the combinations is beingtested by considering it as the center of a cluster that has aradius of r hence storing the resource values of the nodesfrom the potential clusters into a temporary array buffer119860(lowast lowast 119911
119894) The results from those potential clusters which
do satisfy the boundary and nonoverlapping constraints arethen copied to a candidate buffer Out of the clusters whoseresource values are stored in the candidate buffer 119878 thecombination of 119870 clusters that has the great total resourcevalue is selected and their values are placed in the final buffer119862 The corresponding pseudocode is shown in Pseudocode 1
5 Experimental Results and Analysis
In this section the performance of the proposed methodol-ogy is shown by presenting both numerical and visualizedresults for all performance aspects over various algorithms Acase study of road traffic is used in the experimentThe spatialarea is a metropolitan traffic map with roads and streetsspanning all over the place The resource value in this case isthe concentration or density of vehicle traffic flows Sensorsare assumed to have been deployed in every appropriate point
Table 1 Comparison between Bwmorph function and thinningalgorithm
Bwmorph function Thinning algorithmDataset 1 Dataset 2 Dataset 1 Dataset 2
Degree ofthinning Incomplete Complete
Elapsed time(secs) 20 38 100 198
Complexity 119874(119899) 119874(1198992)
of the roads thereby a typical traffic volume is each of thesepoints is known The optimization of spatial clustering inthis case can be thought of as optimal resource allocationfor example cost-effective police patrols gas stations orenvironment-pollution controls are needed among thosedense traffic spots
51 Data Preprocessing Two different factual datasets areused for experiments The first dataset is published byMaricopa Association of Governments in 2008 which isa traffic volume map Traffic volumes were derived fromthe national traffic recording devices Seasonal variation isfactored into the volumes The second dataset is an annualaverage daily traffic of Baltimore County Traffic VolumeMapin 2011 in USA prepared by the Maryland Department ofTransportation and published by March 19 2012 The trafficcount estimates are derived by taking 48-hourmachine countdata and applying factors frompermanent count stationsThetraffic counts represent the resource values in a general sense
After using skeleton extraction a two-tone image wasobtained from the original map Readers are referred to therespective websites where they can see the traffic volume datathat are associated with our two datasets (a) Representativetraffic volume map of dataset 1mdashTraffic Volume Map ofPhoenix AZUSA (httpphoenixgovstreetstrafficvolume-map) (b) Representative traffic volume map of dataset2mdashTraffic Volume Map of Baltimore MD USA (httpwwwmarylandroadscomTraffic Volume MapsTraffic VolumeMapspdf) And the corresponding result skeleton extraction
8 International Journal of Distributed Sensor Networks
Table 2 Important statistics from the clustering and LP experiments
Method Cluster number Number of cells covered Minimum Maximum Overlap
KM
Cluster 1 428 0 3499327 0Cluster 2 468 0 546896 0Cluster 3 448 0 20503007 0Cluster 4 614 0 6894667 0Cluster 5 618 0 900908 0
XM
Cluster 1 615 0 591265 0Cluster 2 457 0 546896 0Cluster 3 609 0 900908 0Cluster 4 465 0 3499327 0Cluster 5 430 0 20503007 0
EM
Cluster 1 1223 0 2292 61817229Cluster 2 7 141048 243705 313018Cluster 3 81 0 3033733 131146577Cluster 4 64 26752 546896 330881249Cluster 5 1201 0 1300026 217950471
DB
Cluster 1 13 23614 33146 327222911Cluster 2 11 1686825 21001 363965818Cluster 3 13 178888 2945283 196118393Cluster 4 11 847733 211008 58940877Cluster 5 2528 0 546896 20554176
HC
Cluster 1 291 0 3499327 0Cluster 2 191 0 20503007 96762283Cluster 3 294 0 1590971 0Cluster 4 224 0 189812 12673555Cluster 5 243 0 546896 0
LP
Cluster 1 221 0 3499327 0Cluster 2 221 0 20503007 0Cluster 3 221 0 1590971 0Cluster 4 221 0 189812 0Cluster 5 221 0 546896 0
Table 3 Comparison for running time of the first dataset
Formats KM HC DBscan XM EM LPVector database 327 1252 2324 278 930 183Raster database 342 1536 2820 284 984 201RasterP (16 grids) 198 134 508 046 057 078RasterP (25 grids) 009 014 115 021 012 053
in dataset 1 is shown in Figure 8 where (a) adopted a kind ofmorphological operation method and (b) adopted thinningalgorithm respectively Likewise the corresponding resultskeleton extraction in the second dataset is shown inFigure 9 where (a) adopted a kind of morphologicaloperation method and (b) adopted thinning algorithmrespectively The comparison result of the two datasets isshown in Table 1
For the raw dataset we firstly perform the image prepro-cessing over it to obtain numerical database
The results of the skeleton extraction as shown in Figures8(b) and 9(b) are more clearly and useful for the following
processing Subsequently the clustering by grid can bereadily obtained from the preprocessed imagesThe extent ofimage thinning is better and more complete by the thinningalgorithm than the Bwmorph function in MATLAB But theelapsed time is longer due to a two-layer iteration nestingprocedure in the program code
The choice of placing a grid on the image follows oneprinciple mesh segmentation is not trying to fall on a con-centrated position of traffic flow Since there is no endpointthe midpoint of the two adjacent values was considered ademarcation point Under this assumption the traffic flow ineach grid is calculated and stored digitally in an Excel file Adigital data for the trafficmap serves as the initial data for thesubsequent clustering process
52 Comparison Result of KM and HC Clustering InXLMiner two methods were used to perform clustering KMand HC In order to compare the two methods for the twodatasets input variables were normalized and the numberof clusters is set at five and maximum iterations at 100 Theinitial centroids are chosen randomly at start Furthermore
International Journal of Distributed Sensor Networks 9
Table 4 Comparison for log-likelihood of first dataset
Formats KM HC DBScan XM EMVector database minus1241868 minus1407265 minus1328599 minus119533 minus1249562Raster database minus1342238 minus1502863 minus1378889 minus129632 minus1339769RasterP (16 grids) 1262264 minus1402266 minus1248583 minus1239419 minus1244993RasterP (25 grids) minus1241868 minus1319417 minus1122207 minus1248201 minus1162048
(a) (b)
Figure 8 (a) Result of skeleton extraction in dataset 1 using Bwmorph function (b) Result of skeleton extraction in dataset 1 using thinningalgorithm
(a) (b)
Figure 9 (a) Result of skeleton extraction in dataset 2 using Bwmorph function (b) Result of skeleton extraction in dataset 2 using thinningalgorithm
Table 5 Comparison for running time of the second dataset
Formats KM HC DBScan XM EM LPVector database 139 134 1553 153 1005 337Raster database 241 1478 1834 217 823 196RasterP (16 grids) 047 801 1274 045 377 144RasterP (25 grids) 035 620 1098 036 296 118
the weights for the corresponding three attributes (119909 119910 V)for each grid (119892
119894= (119909119894 119910119894 V119894)) based on defining weight of
119909 and 119910 could be varied (fine-tuned) and the sum of weightsmust be equal to 1 We tested several variations searching forthe best clustering results (1) weight of V is 20 (2) weightof V is 40 (3) weight of V is 50 (4) weight of V is 60 (5)weight of V is 80 (6) all of them have same weight at 333
(7) weight of V is 0 (8) same weight except when 119892119894(V119894= 0)
and (9) weights of 119909 and 119910 are both 0 except when 119892119894(V119894= 0)
In HC method normalization of the input data waschosen Another option available is similarity measure Itadopts Euclidean distance to measure raw numeric dataMeanwhile the other two options Jaccardrsquos coefficients andmatching coefficient are activated only when the data isbinary
For the above nine cases results of cases (1) to (6) aresimilar in their separate methods And result of (9) is theworst which does not accomplish any clustering Results ofcases (2) (3) (7) and (8) are demonstrated in Figure 10
For the distribution of clusters in the result of KMclustering method more than half of data points are clampedinto one oversized cluster The result of this method istherefore not helpful for further operation For HC methoddata on average are allocated into separate clustersThe result
10 International Journal of Distributed Sensor Networks
Table 6 Comparison for log-likelihood of second dataset
Formats KM HC DBScan XM EMVector database minus1735412 minus1962367 minus1753576 minus1721513 minus1657263Raster database minus1815926 minus2012568 minus1970756 minus1815791 minus1848209RasterP (16 grids) minus1551437 minus1724736 minus1637147 minus1701283 minus1566231RasterP (25 grids) minus1484761 minus1663789 minus1509146 minus1667312 minus1647823
1
2
5
4
41
3
32
5
(a)
1
2 3
4
5
4 51
3
2
(b)
5
5
4
1
4
1 3
3
2
2
5
(c)
2
2
4
3
3
1 5
5
4
1
(d)
Figure 10 (a) Clustering results for the first dataset with setting case (2) where weight of V is 40 top half uses KM clustering methodand bottom half uses HC method (b) Clustering results for the first dataset with setting case (3) where weight of V is 50 top half uses KMclustering method and bottom half uses HC method (c) Clustering results for the first dataset with setting case (7) where weight of V is 0top half uses KM clustering method and bottom half uses HC method (d) Clustering results for the first dataset with setting case (3) whereall share the same weight except 119892
119894(V119894= 0) top half uses KM clustering method and bottom half uses HC method
Table 7 Comparison of running time (in seconds) of four differentsizes of dataset
Dataset size KM HC DBScan XM EM LP100 grid cells 006 007 105 219 321 0184600 grid cells 042 295 3989 273 1905 93710000 grid cells 262 4667 9755 297 3785 242180000 grid cells 1975 18961 684 647 19831 9083
in Figure 10(c) is the best showing only the one with distinctposition attributes (119909 and 119910) The other three results (Figures10(a) 10(b) and 10(d)) are stained with cluster overlapsTherefore allocation of critical resource for example in eachcluster may result in a waste of resources The degree ofoverlap is the least in the result of Figure 10(b) If only locationis being considered the result of Figure 10(c) is the bestchoice Otherwise the result in Figure 10(b) is better than theother two for the sake of cluster distribution
The clustering results of the second dataset performanceby using the two methods KM and HC are shown inFigure 11
From the results of the cluster distribution of the seconddataset obtained by both clustering methods the size of eachcluster is more or less similar which is better than that ofthe first dataset And there is no overlap phenomenon inthe KM results This is a promising feature of KM methodfor spatial clustering However there is little overlap in theresult of HC method as the clusters seem to take irregularshapes Above all for the second dataset KM is a better choicefor consideration of even cluster distribution and overlapavoidance by using both clustering methods
53 Results of Grouping In this part we compare the coloredmap of Raster (119909 119910 V) data model in two datasets usingfive clustering methods in Weka and the LP method Thecommon requirement is no overlap for each of the resultingmaps The number of cluster is arbitrarily chosen at five The
International Journal of Distributed Sensor Networks 11
Table 8 Numeric results of coverage of each cluster by using the six methods for dataset 1
Cov-db1 KM EM DBScan XM HC LPCluster 0 0029436 0003786 0017902 0075178 0013153 0028985Cluster 1 0301538 0269602 0208078 0049761 0026016 0377034Cluster 2 0215277 0001627 0158439 0084049 012436 0080099Cluster 3 0046788 0096221 0079177 0209390 0001172 0217204Cluster 4 0002712 0161799 0044197 0043152 03043 0007704Total coverage 0595751 0533036 0507793 0461531 0469 0711025
41 5
3 2
(a)
4
13
5
2
(b)
Figure 11 (a) Clustering results for the second dataset by usingKMmethod (b) Clustering results for the second dataset by usingHCmethod
result of first dataset is shown in Figure 12 The first part (i)of Figure 12 shows the spatial clustering result the secondpart (ii) visualizes the corresponding spatial groups by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method The centers of the clustersare computed after clustering is done and then the groupsare visualized over the clustering results according to thecomputed centers
In Figure 12 for the results of (a) and (e) the sizes ofclusters are quite uneven more than half of dataset fall intoone cluster Thus this result reveals a fact that the techniquecannot organize a dataset into homogeneous andor well-separated groups with respect to a distance or equivalentlya similarity measureThe corresponding groups have overlapphenomenon too For the result of (c) the sizes of the clustersare uneven too For the result of (b) and (d) the sizes of clusterseem to be similar to each other There is also no overlapin the clustering result but for group result the groups in(d) have far more overlaps than those in (b) Overlap meanssome part or the cluster gets in the way of another onewhichmeans that there is superposition between two ormoredifferent clusters Again itmay cause resourcewaste and evenfalse allocation This situation occurs in important fields ofapplications such as information retrieval (several thematicfor a single document) and biological data (several metabolicfunctions for one gene) For this reason (b) is better than (d)According to the above analysis for the result of clusteringand corresponding groups (d) XM is so far the best choiceof clustering algorithm as evidenced by the colored mapsthereafter
With the same experiment setup and operating environ-ment the spatial clustering experiments are performed overthe second dataset The results of second dataset are shown
in Figure 13 where (i) represents the spatial clustering resultand (ii) represents the corresponding spatial group by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method
In Figures 13(a) and 13(e) it is noticed that the clustersare imbalanced and there are overlaps in the correspondingspatial groups using the method of (a) EM and (e) DBScanThe results of (b) KM and (d) XM however avoid theshortcomings of (a) and (e) though they still have slightoverlaps For (c) HC we remove the empty cells in theboundary to reduce the size of dataset the clustering resultis perfect There is no overlap and clusters are balancedbetween each other But there is still overlap in the spatialgroups Thus LP method is adopted to solve this problemand in possession of same size of groups The result of LPmethod yields perfectly balanced groups without any overlapas shown in Figure 13(f)
By visually comparing the clustering results of the twodatasets the clustering results seem to be similar but thespatial groups are somewhat differentOccurrence of overlapsin spatial groups is more severe in the first dataset than in thesecond one The overlaps are likely due to data distributionand balance in sizes between each cluster For the first datasetthe heavy spatial values which are traffic volumes in this caseare mainly concentrated in the center region (city center) solocations of the computed clusters tend to cram very near ata crowded spot In contrast the traffic volumes in the seconddataset are dispersed over a large area As seen from the visualspatial groups of the second dataset the cluster positions area little far apart when compared to those in the first dataset
Based on the results generated from the clustering andLP experiments some statistic information of dataset 2 iscollected and it is shown in Table 2 The numeric results in
12 International Journal of Distributed Sensor Networks
(i)
(ii)
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(ii)
(i)
(d)
(i)
(ii)
(e)
Figure 12 (a) (i) Spatial clustering on dataset 1 by using EM (ii) Spatial groups on dataset 1 from the results of using EM (b) (i) spatialclustering on dataset 1 by using KM (ii) Spatial groups on dataset 1 from the results of using KM (c) (i) spatial clustering on dataset 1 byusing HC (ii) Spatial groups on dataset 1 from the results of using HC (d) (i) spatial clustering on dataset 1 by using XM (ii) Spatial groupson dataset 1 from the results of using XM (e) (i) spatial clustering on dataset 1 by using DBScan (ii) Spatial group on dataset 1 from theresults of using DBScan
Table 3 support the qualitative analysis by visual inspectionin the previous section By comparingHC and LPmethods asan example the quantitative results show that they have thegreatest differences in cell numbers covered by the clustersalso the amount of overlap in HC is the highest of all By theLP method the size of each cluster is exactly the same andthey are totally free from overlap
6 Technical Analysis of Clustering Results
61 Experimental Evaluation Method For the purpose ofassessing how the qualities of spatial groups from clusteringare several evaluation factors are defined here running time(short for time) balance log-likelihood overlap densityand coverage For a fair comparison the datasets are run inthe same software environment of the same computer And
assume the number of groups to be five with six differentmethods Running time is the time we used to run eachmethod using the same software in the same computerto completion Balance is used to measure the sizes ofgroups if balanced the size of each group is the same Log-likelihood is an important measure for clustering qualitythe bigger the value the better Weka tests for goodness-of-fit by the likelihood in logarithm which is called log-likelihood A large log-likelihood means that the clusteringmodel is suitable for the data under test Overlap means thatthe spatial values (eg traffic volumes sensed by the sensors)do belong to more than one cluster Density is the averagespatial values (traffic volumes) per grid cell in each clusterCoverage of a cluster means the proportion of traffic volumesthat are covered by the grid cells within the cluster overthe whole dataset meanwhile total coverage is the sum of
International Journal of Distributed Sensor Networks 13
(i)
(ii)
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(i)
(ii)
(d)
(i)
(ii)
(e)
(f)
Figure 13 (a) (i) spatial clustering on dataset 2 by using EM (ii) Spatial groups on dataset 2 from the results of using EM (b) (i) spatialclustering on dataset 2 by using KM (ii) Spatial groups on dataset 2 from the results of using KM (c) (i) spatial clustering on dataset 2 byusing HC (ii) Spatial groups on dataset 2 from the results of using HC (d) (i) spatial clustering on dataset 2 by using XM (ii) Spatial groupson dataset 2 from the results of using XM (e) (i) spatial clustering on dataset 2 by using DBScan (ii) Spatial group on dataset 2 from theresults of using DBScan (f) Spatial group in LP method on dataset 2
14 International Journal of Distributed Sensor Networks
traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below
Density (cluster 119894) =sumTraffic Volumes (cluster 119894)
Grid Cell Number (cluster 119894)
Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)
sumGrid Cell Number
Total Coverage = sumTraffic Volumes minusOverlaps
Proportion of Cluster (119894) Size (Balance)
=Grid Cell Number (cluster 119894)
sumGrid Cell Number
(4)
62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively
According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest
In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one
In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14
According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On
0
100
200
300
400
500
600
700
800
900
0 20000 40000 60000 80000 100000
K-meansHierarchicalDBScanXMeanEM
Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP
K-means)
Figure 14 Comparison of running time (in seconds) of differentsizes of dataset
the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally
In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one
In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC
The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets
From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage
International Journal of Distributed Sensor Networks 15
Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2
Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700
0
01
02
03
04
05
06
07
08
KM EM DBScan XM HC LP
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
CoverageCoverage
(a)
0
01
02
03
04
05
06
KM EM DBScan XM HC LP
Coverage
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
Coverage
(b)
Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2
Total density
0
200
400
600
800
1000
1200
1400
1600
KM EM DBScan XM HC LP
Density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(a)
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
KM EM DBScan XM HC LP
Density
Total density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(b)
Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2
16 International Journal of Distributed Sensor Networks
4
5136
8
1
Balance test on dataset 1
(a) KM
1
50
1
18
30
Balance test on dataset 1
(b) XM
6
22
24
30
18
Balance test on dataset 1
(c) EM
24
24
17
20
15
Balance test on dataset 1
(d) DBScan
18
17
22
19
25
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP
From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means
that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered
International Journal of Distributed Sensor Networks 17
17
18
17
24
24
Balance test on dataset 2
(a) KM
24
18
24
18
17
Balance test on dataset 2
(b) XM
47
032
47
Balance test on dataset 2
(c) EM
1010
98
Balance test on dataset 2
(d) DBScan
23
15
24
18
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP
The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely
63 Discussion of G119899119890119905 For all the six evaluation factors each
of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in
18 International Journal of Distributed Sensor Networks
Table 10 Numeric results of density of each cluster by using the six methods for dataset 1
Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049
Table 11 Numeric results of density of each cluster by using the six methods for dataset 2
Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447
order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866
119888can take a relatively very large value or even
1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes
119866119897=
10038161003816100381610038161003816100381610038161003816
LikelihoodTime
10038161003816100381610038161003816100381610038161003816
(5)
119866119887=Difference of Balance
Time (6)
119866119889=DensityTime
(7)
119866119888=CoverageTime
(8)
119866119900=OverlapTime
(9)
119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)
Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)
From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in
Table 12 which allows us to easily compare various methodsand performance aspects
In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866
119897 119866119887 119866119889 119866119888 and 119866
119900are computed
for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13
According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion
7 Conclusion and Future Works
Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for
International Journal of Distributed Sensor Networks 19
Table 12 Performance indicators of the six methods based on dataset 2
Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0
Table 13 Comparison of different clustering and LP methods by119866net indicator
Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132
purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users
The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes
For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be
good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed
References
[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000
[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012
[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012
[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012
[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012
[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003
[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006
[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002
[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008
[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006
[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004
[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998
[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000
20 International Journal of Distributed Sensor Networks
[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004
[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005
[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010
[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998
[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006
[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999
[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001
[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009
[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999
[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000
[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996
[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963
[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
6 International Journal of Distributed Sensor Networks
(a) (b) (c)
Figure 7 Illustration of possible ways of assigning clusters for maximum (a) fish population (b) altitude of terrain and (c) human inhabitantpopulation
[25] The notions of density reachability and density con-nectivity are used as performance indicators for the qualityof clustering [26] A cluster is composed of the group ofobjects in a dataset that are density connected to a particularcenter Any object that falls beyond a cluster is considered asnoise
Ward proposed a clustering method called hierarchicalclustering (HC) in 1963 [27] It tries to find how to formsomething to divide 119875
119899 119875119899minus1
1198751in a way that reduces
the relationship with each group In each step analysisstep it considered every possible cluster pair in group andcombined the two clusters with a very close joining of resultsin ldquoinformation lossrdquo which is given definition by Wardaround ESS (an error sum-of-squares criterion)The idea thatsupports Wardrsquos proposal can be described most simply bythinking of a little single data Take ten objects with scores asan example (2 7 6 6 7 2 2 0 2 0) The loss of informationwould be achieved by calculating ESS with a mean of 34which takes into account the ten scores as a unit as followsESS One group = (2 minus 34)
2+ (7 minus 34)
2+ sdot sdot sdot + (0 minus 34)
2=
4728 However those 10 objects can also be separated intofour groups according to their scores 0 0 0 2 2 2 26 6 and 7 7 Finally for evaluation of the ESS as a sum ofsquares we can obtain four independent error sums of eachsquare Overall the result that divides the 10 objects into 4clusters has no loss of information as follows
ESS One group = ESS group1 + ESS group2
+ ESS group3 + ESS group4 = 0
(2)
The last method we adopted here is linear programming(LP) which contains instituting and producing an answerto optimization problems with linear objective functionsand linear constraints This powerful tool can be used inmany fields especially where many options are possible inthe answers In spatial grouping over a large grid manypossible combinations of positioning the clusters exist Theproblem here is to find a certain number of clusters of
equal size over the area meanwhile the chosen centers ofthe clusters must allow sufficient distance apart from eachother so as to avoid overlapping As an example shownin Figure 7 three clusters would have to be assigned overa spatial area in a way that they would have to covercertain resources The assignment of the clusters howeverwould have to yield a maximum total value summed fromcovered resources In the example the left diagram showsallocating three clusters over the deepwater assuming that theresources are fish hence the maximum harvest The secondexample in the middle of Figure 7 is clustering the highaltitude over the area The last example is trying to coverthe maximum human inhabitants which are concentratedat the coves Given many possible ways of setting up theseclusters LP is used to formulate this allocation problemwith an objective of maximizing the values of the coveredresources
Assuming that the resources could be dynamic forexample animal herds or moving targets whose positionsmay swarm and change over time the optimization is atypical maximal flow problem (or max flow problem) Theoptimization is a type of network flow problem in whichthe goal is to determine the maximum amount of flowthat can occur over arc whish is limited by some capacityrestriction This type of network might be used to modelthe flow of oil in pipeline (in which the amount of oil thatcan flow through a pipe in a unit of time is limited by thediameter of the pipe) Traffic engineers also use this type ofnetwork to determine the maximum number of cars that cantravel through a collection of streets with different capacitiesimposed by the number of lanes in the streets and speed limits[28]
For our spatial clustering we consider each cell of the gridas a node each node is defined as a tuple119898 that contains thecoordinates and the value of the resource that is held in thenode such that 119898(119909
119894 119910119894 119911119894) represents an 119894th node in which
119909119894 119910119894represent the position and 119911
119894represents the value of
resource in the node respectively For the clusters each node
International Journal of Distributed Sensor Networks 7
(1) Load the grid-based spatial information into array 119860(119909 119910 119911) 119860 is a three dimensional array(2) Repeat (through all coordinates of 119909)(3) Repeat (through all coordinates of 119910)(4) If (boundary constraints and overlapping constraints are satisfied) Then(5) 119878(119909
119894 119910119894 119911119894) = 119860(119909
119894 119910119894 119911119894)
(6) End-if(7) End-loop(8) End-loop(9) If size of (119878) ge 119870
(10) Find top 119870 clusters where maxsum119911119894⨁119862119896 copy 119878(119909
119894 119910119894 119911119894) to new array 119862(119909
119894 119910119894 119911119894) forall119894 isin 119862
119896
(11) Else-if(12) 119862(119909
119894 119910119894 119911119894) = 119878(119909
119894 119910119894 119911119894) forall119894
(13) End-if
Pseudocode 1 Pseudocode of the proposed LP model for spatial clustering
can potentially be a center of a cluster and the cluster hasa fixed radius of length 119903 The LP model for our problem ismathematically shown as follows
Total value = ⋃
selected clusters ⟨119862119896|119896=1sdotsdotsdot119870⟩sum
119898119894isin119862119896
119898119894(lowast lowast 119911
119894)
= argmax119883119884
sum
0le119909119894le119883
0le119910119894le119884
119870
sum
119896=1
119911119897ni 119898119897(119909119894 119910119895 119911) oplus 119888
119896
(3)
Subject to the boundary constraints of 2r le |119909119894minus 119909119895| and 2r
le |119909119894minus 119909119895| for all 119894 and 119895 but 119894 = 119895 where119883 is the maximum
width and 119884 is the maximum length of the 2D spatial arearespectively 119896 isin 119870 is the maximum number of clusters and119888119896is the 119896th cluster under consideration in the optimizationIn order to implement the computation as depicted in
(3) for each node we sum each group resources in a shapeof diamond (which geometrically approximates a circle) Byiterating through every combination of119870 nodes in the grid ofsize 119883 by 119884 each current node in the combinations is beingtested by considering it as the center of a cluster that has aradius of r hence storing the resource values of the nodesfrom the potential clusters into a temporary array buffer119860(lowast lowast 119911
119894) The results from those potential clusters which
do satisfy the boundary and nonoverlapping constraints arethen copied to a candidate buffer Out of the clusters whoseresource values are stored in the candidate buffer 119878 thecombination of 119870 clusters that has the great total resourcevalue is selected and their values are placed in the final buffer119862 The corresponding pseudocode is shown in Pseudocode 1
5 Experimental Results and Analysis
In this section the performance of the proposed methodol-ogy is shown by presenting both numerical and visualizedresults for all performance aspects over various algorithms Acase study of road traffic is used in the experimentThe spatialarea is a metropolitan traffic map with roads and streetsspanning all over the place The resource value in this case isthe concentration or density of vehicle traffic flows Sensorsare assumed to have been deployed in every appropriate point
Table 1 Comparison between Bwmorph function and thinningalgorithm
Bwmorph function Thinning algorithmDataset 1 Dataset 2 Dataset 1 Dataset 2
Degree ofthinning Incomplete Complete
Elapsed time(secs) 20 38 100 198
Complexity 119874(119899) 119874(1198992)
of the roads thereby a typical traffic volume is each of thesepoints is known The optimization of spatial clustering inthis case can be thought of as optimal resource allocationfor example cost-effective police patrols gas stations orenvironment-pollution controls are needed among thosedense traffic spots
51 Data Preprocessing Two different factual datasets areused for experiments The first dataset is published byMaricopa Association of Governments in 2008 which isa traffic volume map Traffic volumes were derived fromthe national traffic recording devices Seasonal variation isfactored into the volumes The second dataset is an annualaverage daily traffic of Baltimore County Traffic VolumeMapin 2011 in USA prepared by the Maryland Department ofTransportation and published by March 19 2012 The trafficcount estimates are derived by taking 48-hourmachine countdata and applying factors frompermanent count stationsThetraffic counts represent the resource values in a general sense
After using skeleton extraction a two-tone image wasobtained from the original map Readers are referred to therespective websites where they can see the traffic volume datathat are associated with our two datasets (a) Representativetraffic volume map of dataset 1mdashTraffic Volume Map ofPhoenix AZUSA (httpphoenixgovstreetstrafficvolume-map) (b) Representative traffic volume map of dataset2mdashTraffic Volume Map of Baltimore MD USA (httpwwwmarylandroadscomTraffic Volume MapsTraffic VolumeMapspdf) And the corresponding result skeleton extraction
8 International Journal of Distributed Sensor Networks
Table 2 Important statistics from the clustering and LP experiments
Method Cluster number Number of cells covered Minimum Maximum Overlap
KM
Cluster 1 428 0 3499327 0Cluster 2 468 0 546896 0Cluster 3 448 0 20503007 0Cluster 4 614 0 6894667 0Cluster 5 618 0 900908 0
XM
Cluster 1 615 0 591265 0Cluster 2 457 0 546896 0Cluster 3 609 0 900908 0Cluster 4 465 0 3499327 0Cluster 5 430 0 20503007 0
EM
Cluster 1 1223 0 2292 61817229Cluster 2 7 141048 243705 313018Cluster 3 81 0 3033733 131146577Cluster 4 64 26752 546896 330881249Cluster 5 1201 0 1300026 217950471
DB
Cluster 1 13 23614 33146 327222911Cluster 2 11 1686825 21001 363965818Cluster 3 13 178888 2945283 196118393Cluster 4 11 847733 211008 58940877Cluster 5 2528 0 546896 20554176
HC
Cluster 1 291 0 3499327 0Cluster 2 191 0 20503007 96762283Cluster 3 294 0 1590971 0Cluster 4 224 0 189812 12673555Cluster 5 243 0 546896 0
LP
Cluster 1 221 0 3499327 0Cluster 2 221 0 20503007 0Cluster 3 221 0 1590971 0Cluster 4 221 0 189812 0Cluster 5 221 0 546896 0
Table 3 Comparison for running time of the first dataset
Formats KM HC DBscan XM EM LPVector database 327 1252 2324 278 930 183Raster database 342 1536 2820 284 984 201RasterP (16 grids) 198 134 508 046 057 078RasterP (25 grids) 009 014 115 021 012 053
in dataset 1 is shown in Figure 8 where (a) adopted a kind ofmorphological operation method and (b) adopted thinningalgorithm respectively Likewise the corresponding resultskeleton extraction in the second dataset is shown inFigure 9 where (a) adopted a kind of morphologicaloperation method and (b) adopted thinning algorithmrespectively The comparison result of the two datasets isshown in Table 1
For the raw dataset we firstly perform the image prepro-cessing over it to obtain numerical database
The results of the skeleton extraction as shown in Figures8(b) and 9(b) are more clearly and useful for the following
processing Subsequently the clustering by grid can bereadily obtained from the preprocessed imagesThe extent ofimage thinning is better and more complete by the thinningalgorithm than the Bwmorph function in MATLAB But theelapsed time is longer due to a two-layer iteration nestingprocedure in the program code
The choice of placing a grid on the image follows oneprinciple mesh segmentation is not trying to fall on a con-centrated position of traffic flow Since there is no endpointthe midpoint of the two adjacent values was considered ademarcation point Under this assumption the traffic flow ineach grid is calculated and stored digitally in an Excel file Adigital data for the trafficmap serves as the initial data for thesubsequent clustering process
52 Comparison Result of KM and HC Clustering InXLMiner two methods were used to perform clustering KMand HC In order to compare the two methods for the twodatasets input variables were normalized and the numberof clusters is set at five and maximum iterations at 100 Theinitial centroids are chosen randomly at start Furthermore
International Journal of Distributed Sensor Networks 9
Table 4 Comparison for log-likelihood of first dataset
Formats KM HC DBScan XM EMVector database minus1241868 minus1407265 minus1328599 minus119533 minus1249562Raster database minus1342238 minus1502863 minus1378889 minus129632 minus1339769RasterP (16 grids) 1262264 minus1402266 minus1248583 minus1239419 minus1244993RasterP (25 grids) minus1241868 minus1319417 minus1122207 minus1248201 minus1162048
(a) (b)
Figure 8 (a) Result of skeleton extraction in dataset 1 using Bwmorph function (b) Result of skeleton extraction in dataset 1 using thinningalgorithm
(a) (b)
Figure 9 (a) Result of skeleton extraction in dataset 2 using Bwmorph function (b) Result of skeleton extraction in dataset 2 using thinningalgorithm
Table 5 Comparison for running time of the second dataset
Formats KM HC DBScan XM EM LPVector database 139 134 1553 153 1005 337Raster database 241 1478 1834 217 823 196RasterP (16 grids) 047 801 1274 045 377 144RasterP (25 grids) 035 620 1098 036 296 118
the weights for the corresponding three attributes (119909 119910 V)for each grid (119892
119894= (119909119894 119910119894 V119894)) based on defining weight of
119909 and 119910 could be varied (fine-tuned) and the sum of weightsmust be equal to 1 We tested several variations searching forthe best clustering results (1) weight of V is 20 (2) weightof V is 40 (3) weight of V is 50 (4) weight of V is 60 (5)weight of V is 80 (6) all of them have same weight at 333
(7) weight of V is 0 (8) same weight except when 119892119894(V119894= 0)
and (9) weights of 119909 and 119910 are both 0 except when 119892119894(V119894= 0)
In HC method normalization of the input data waschosen Another option available is similarity measure Itadopts Euclidean distance to measure raw numeric dataMeanwhile the other two options Jaccardrsquos coefficients andmatching coefficient are activated only when the data isbinary
For the above nine cases results of cases (1) to (6) aresimilar in their separate methods And result of (9) is theworst which does not accomplish any clustering Results ofcases (2) (3) (7) and (8) are demonstrated in Figure 10
For the distribution of clusters in the result of KMclustering method more than half of data points are clampedinto one oversized cluster The result of this method istherefore not helpful for further operation For HC methoddata on average are allocated into separate clustersThe result
10 International Journal of Distributed Sensor Networks
Table 6 Comparison for log-likelihood of second dataset
Formats KM HC DBScan XM EMVector database minus1735412 minus1962367 minus1753576 minus1721513 minus1657263Raster database minus1815926 minus2012568 minus1970756 minus1815791 minus1848209RasterP (16 grids) minus1551437 minus1724736 minus1637147 minus1701283 minus1566231RasterP (25 grids) minus1484761 minus1663789 minus1509146 minus1667312 minus1647823
1
2
5
4
41
3
32
5
(a)
1
2 3
4
5
4 51
3
2
(b)
5
5
4
1
4
1 3
3
2
2
5
(c)
2
2
4
3
3
1 5
5
4
1
(d)
Figure 10 (a) Clustering results for the first dataset with setting case (2) where weight of V is 40 top half uses KM clustering methodand bottom half uses HC method (b) Clustering results for the first dataset with setting case (3) where weight of V is 50 top half uses KMclustering method and bottom half uses HC method (c) Clustering results for the first dataset with setting case (7) where weight of V is 0top half uses KM clustering method and bottom half uses HC method (d) Clustering results for the first dataset with setting case (3) whereall share the same weight except 119892
119894(V119894= 0) top half uses KM clustering method and bottom half uses HC method
Table 7 Comparison of running time (in seconds) of four differentsizes of dataset
Dataset size KM HC DBScan XM EM LP100 grid cells 006 007 105 219 321 0184600 grid cells 042 295 3989 273 1905 93710000 grid cells 262 4667 9755 297 3785 242180000 grid cells 1975 18961 684 647 19831 9083
in Figure 10(c) is the best showing only the one with distinctposition attributes (119909 and 119910) The other three results (Figures10(a) 10(b) and 10(d)) are stained with cluster overlapsTherefore allocation of critical resource for example in eachcluster may result in a waste of resources The degree ofoverlap is the least in the result of Figure 10(b) If only locationis being considered the result of Figure 10(c) is the bestchoice Otherwise the result in Figure 10(b) is better than theother two for the sake of cluster distribution
The clustering results of the second dataset performanceby using the two methods KM and HC are shown inFigure 11
From the results of the cluster distribution of the seconddataset obtained by both clustering methods the size of eachcluster is more or less similar which is better than that ofthe first dataset And there is no overlap phenomenon inthe KM results This is a promising feature of KM methodfor spatial clustering However there is little overlap in theresult of HC method as the clusters seem to take irregularshapes Above all for the second dataset KM is a better choicefor consideration of even cluster distribution and overlapavoidance by using both clustering methods
53 Results of Grouping In this part we compare the coloredmap of Raster (119909 119910 V) data model in two datasets usingfive clustering methods in Weka and the LP method Thecommon requirement is no overlap for each of the resultingmaps The number of cluster is arbitrarily chosen at five The
International Journal of Distributed Sensor Networks 11
Table 8 Numeric results of coverage of each cluster by using the six methods for dataset 1
Cov-db1 KM EM DBScan XM HC LPCluster 0 0029436 0003786 0017902 0075178 0013153 0028985Cluster 1 0301538 0269602 0208078 0049761 0026016 0377034Cluster 2 0215277 0001627 0158439 0084049 012436 0080099Cluster 3 0046788 0096221 0079177 0209390 0001172 0217204Cluster 4 0002712 0161799 0044197 0043152 03043 0007704Total coverage 0595751 0533036 0507793 0461531 0469 0711025
41 5
3 2
(a)
4
13
5
2
(b)
Figure 11 (a) Clustering results for the second dataset by usingKMmethod (b) Clustering results for the second dataset by usingHCmethod
result of first dataset is shown in Figure 12 The first part (i)of Figure 12 shows the spatial clustering result the secondpart (ii) visualizes the corresponding spatial groups by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method The centers of the clustersare computed after clustering is done and then the groupsare visualized over the clustering results according to thecomputed centers
In Figure 12 for the results of (a) and (e) the sizes ofclusters are quite uneven more than half of dataset fall intoone cluster Thus this result reveals a fact that the techniquecannot organize a dataset into homogeneous andor well-separated groups with respect to a distance or equivalentlya similarity measureThe corresponding groups have overlapphenomenon too For the result of (c) the sizes of the clustersare uneven too For the result of (b) and (d) the sizes of clusterseem to be similar to each other There is also no overlapin the clustering result but for group result the groups in(d) have far more overlaps than those in (b) Overlap meanssome part or the cluster gets in the way of another onewhichmeans that there is superposition between two ormoredifferent clusters Again itmay cause resourcewaste and evenfalse allocation This situation occurs in important fields ofapplications such as information retrieval (several thematicfor a single document) and biological data (several metabolicfunctions for one gene) For this reason (b) is better than (d)According to the above analysis for the result of clusteringand corresponding groups (d) XM is so far the best choiceof clustering algorithm as evidenced by the colored mapsthereafter
With the same experiment setup and operating environ-ment the spatial clustering experiments are performed overthe second dataset The results of second dataset are shown
in Figure 13 where (i) represents the spatial clustering resultand (ii) represents the corresponding spatial group by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method
In Figures 13(a) and 13(e) it is noticed that the clustersare imbalanced and there are overlaps in the correspondingspatial groups using the method of (a) EM and (e) DBScanThe results of (b) KM and (d) XM however avoid theshortcomings of (a) and (e) though they still have slightoverlaps For (c) HC we remove the empty cells in theboundary to reduce the size of dataset the clustering resultis perfect There is no overlap and clusters are balancedbetween each other But there is still overlap in the spatialgroups Thus LP method is adopted to solve this problemand in possession of same size of groups The result of LPmethod yields perfectly balanced groups without any overlapas shown in Figure 13(f)
By visually comparing the clustering results of the twodatasets the clustering results seem to be similar but thespatial groups are somewhat differentOccurrence of overlapsin spatial groups is more severe in the first dataset than in thesecond one The overlaps are likely due to data distributionand balance in sizes between each cluster For the first datasetthe heavy spatial values which are traffic volumes in this caseare mainly concentrated in the center region (city center) solocations of the computed clusters tend to cram very near ata crowded spot In contrast the traffic volumes in the seconddataset are dispersed over a large area As seen from the visualspatial groups of the second dataset the cluster positions area little far apart when compared to those in the first dataset
Based on the results generated from the clustering andLP experiments some statistic information of dataset 2 iscollected and it is shown in Table 2 The numeric results in
12 International Journal of Distributed Sensor Networks
(i)
(ii)
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(ii)
(i)
(d)
(i)
(ii)
(e)
Figure 12 (a) (i) Spatial clustering on dataset 1 by using EM (ii) Spatial groups on dataset 1 from the results of using EM (b) (i) spatialclustering on dataset 1 by using KM (ii) Spatial groups on dataset 1 from the results of using KM (c) (i) spatial clustering on dataset 1 byusing HC (ii) Spatial groups on dataset 1 from the results of using HC (d) (i) spatial clustering on dataset 1 by using XM (ii) Spatial groupson dataset 1 from the results of using XM (e) (i) spatial clustering on dataset 1 by using DBScan (ii) Spatial group on dataset 1 from theresults of using DBScan
Table 3 support the qualitative analysis by visual inspectionin the previous section By comparingHC and LPmethods asan example the quantitative results show that they have thegreatest differences in cell numbers covered by the clustersalso the amount of overlap in HC is the highest of all By theLP method the size of each cluster is exactly the same andthey are totally free from overlap
6 Technical Analysis of Clustering Results
61 Experimental Evaluation Method For the purpose ofassessing how the qualities of spatial groups from clusteringare several evaluation factors are defined here running time(short for time) balance log-likelihood overlap densityand coverage For a fair comparison the datasets are run inthe same software environment of the same computer And
assume the number of groups to be five with six differentmethods Running time is the time we used to run eachmethod using the same software in the same computerto completion Balance is used to measure the sizes ofgroups if balanced the size of each group is the same Log-likelihood is an important measure for clustering qualitythe bigger the value the better Weka tests for goodness-of-fit by the likelihood in logarithm which is called log-likelihood A large log-likelihood means that the clusteringmodel is suitable for the data under test Overlap means thatthe spatial values (eg traffic volumes sensed by the sensors)do belong to more than one cluster Density is the averagespatial values (traffic volumes) per grid cell in each clusterCoverage of a cluster means the proportion of traffic volumesthat are covered by the grid cells within the cluster overthe whole dataset meanwhile total coverage is the sum of
International Journal of Distributed Sensor Networks 13
(i)
(ii)
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(i)
(ii)
(d)
(i)
(ii)
(e)
(f)
Figure 13 (a) (i) spatial clustering on dataset 2 by using EM (ii) Spatial groups on dataset 2 from the results of using EM (b) (i) spatialclustering on dataset 2 by using KM (ii) Spatial groups on dataset 2 from the results of using KM (c) (i) spatial clustering on dataset 2 byusing HC (ii) Spatial groups on dataset 2 from the results of using HC (d) (i) spatial clustering on dataset 2 by using XM (ii) Spatial groupson dataset 2 from the results of using XM (e) (i) spatial clustering on dataset 2 by using DBScan (ii) Spatial group on dataset 2 from theresults of using DBScan (f) Spatial group in LP method on dataset 2
14 International Journal of Distributed Sensor Networks
traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below
Density (cluster 119894) =sumTraffic Volumes (cluster 119894)
Grid Cell Number (cluster 119894)
Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)
sumGrid Cell Number
Total Coverage = sumTraffic Volumes minusOverlaps
Proportion of Cluster (119894) Size (Balance)
=Grid Cell Number (cluster 119894)
sumGrid Cell Number
(4)
62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively
According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest
In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one
In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14
According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On
0
100
200
300
400
500
600
700
800
900
0 20000 40000 60000 80000 100000
K-meansHierarchicalDBScanXMeanEM
Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP
K-means)
Figure 14 Comparison of running time (in seconds) of differentsizes of dataset
the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally
In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one
In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC
The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets
From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage
International Journal of Distributed Sensor Networks 15
Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2
Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700
0
01
02
03
04
05
06
07
08
KM EM DBScan XM HC LP
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
CoverageCoverage
(a)
0
01
02
03
04
05
06
KM EM DBScan XM HC LP
Coverage
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
Coverage
(b)
Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2
Total density
0
200
400
600
800
1000
1200
1400
1600
KM EM DBScan XM HC LP
Density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(a)
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
KM EM DBScan XM HC LP
Density
Total density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(b)
Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2
16 International Journal of Distributed Sensor Networks
4
5136
8
1
Balance test on dataset 1
(a) KM
1
50
1
18
30
Balance test on dataset 1
(b) XM
6
22
24
30
18
Balance test on dataset 1
(c) EM
24
24
17
20
15
Balance test on dataset 1
(d) DBScan
18
17
22
19
25
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP
From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means
that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered
International Journal of Distributed Sensor Networks 17
17
18
17
24
24
Balance test on dataset 2
(a) KM
24
18
24
18
17
Balance test on dataset 2
(b) XM
47
032
47
Balance test on dataset 2
(c) EM
1010
98
Balance test on dataset 2
(d) DBScan
23
15
24
18
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP
The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely
63 Discussion of G119899119890119905 For all the six evaluation factors each
of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in
18 International Journal of Distributed Sensor Networks
Table 10 Numeric results of density of each cluster by using the six methods for dataset 1
Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049
Table 11 Numeric results of density of each cluster by using the six methods for dataset 2
Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447
order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866
119888can take a relatively very large value or even
1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes
119866119897=
10038161003816100381610038161003816100381610038161003816
LikelihoodTime
10038161003816100381610038161003816100381610038161003816
(5)
119866119887=Difference of Balance
Time (6)
119866119889=DensityTime
(7)
119866119888=CoverageTime
(8)
119866119900=OverlapTime
(9)
119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)
Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)
From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in
Table 12 which allows us to easily compare various methodsand performance aspects
In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866
119897 119866119887 119866119889 119866119888 and 119866
119900are computed
for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13
According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion
7 Conclusion and Future Works
Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for
International Journal of Distributed Sensor Networks 19
Table 12 Performance indicators of the six methods based on dataset 2
Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0
Table 13 Comparison of different clustering and LP methods by119866net indicator
Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132
purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users
The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes
For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be
good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed
References
[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000
[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012
[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012
[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012
[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012
[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003
[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006
[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002
[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008
[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006
[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004
[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998
[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000
20 International Journal of Distributed Sensor Networks
[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004
[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005
[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010
[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998
[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006
[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999
[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001
[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009
[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999
[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000
[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996
[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963
[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
International Journal of Distributed Sensor Networks 7
(1) Load the grid-based spatial information into array 119860(119909 119910 119911) 119860 is a three dimensional array(2) Repeat (through all coordinates of 119909)(3) Repeat (through all coordinates of 119910)(4) If (boundary constraints and overlapping constraints are satisfied) Then(5) 119878(119909
119894 119910119894 119911119894) = 119860(119909
119894 119910119894 119911119894)
(6) End-if(7) End-loop(8) End-loop(9) If size of (119878) ge 119870
(10) Find top 119870 clusters where maxsum119911119894⨁119862119896 copy 119878(119909
119894 119910119894 119911119894) to new array 119862(119909
119894 119910119894 119911119894) forall119894 isin 119862
119896
(11) Else-if(12) 119862(119909
119894 119910119894 119911119894) = 119878(119909
119894 119910119894 119911119894) forall119894
(13) End-if
Pseudocode 1 Pseudocode of the proposed LP model for spatial clustering
can potentially be a center of a cluster and the cluster hasa fixed radius of length 119903 The LP model for our problem ismathematically shown as follows
Total value = ⋃
selected clusters ⟨119862119896|119896=1sdotsdotsdot119870⟩sum
119898119894isin119862119896
119898119894(lowast lowast 119911
119894)
= argmax119883119884
sum
0le119909119894le119883
0le119910119894le119884
119870
sum
119896=1
119911119897ni 119898119897(119909119894 119910119895 119911) oplus 119888
119896
(3)
Subject to the boundary constraints of 2r le |119909119894minus 119909119895| and 2r
le |119909119894minus 119909119895| for all 119894 and 119895 but 119894 = 119895 where119883 is the maximum
width and 119884 is the maximum length of the 2D spatial arearespectively 119896 isin 119870 is the maximum number of clusters and119888119896is the 119896th cluster under consideration in the optimizationIn order to implement the computation as depicted in
(3) for each node we sum each group resources in a shapeof diamond (which geometrically approximates a circle) Byiterating through every combination of119870 nodes in the grid ofsize 119883 by 119884 each current node in the combinations is beingtested by considering it as the center of a cluster that has aradius of r hence storing the resource values of the nodesfrom the potential clusters into a temporary array buffer119860(lowast lowast 119911
119894) The results from those potential clusters which
do satisfy the boundary and nonoverlapping constraints arethen copied to a candidate buffer Out of the clusters whoseresource values are stored in the candidate buffer 119878 thecombination of 119870 clusters that has the great total resourcevalue is selected and their values are placed in the final buffer119862 The corresponding pseudocode is shown in Pseudocode 1
5 Experimental Results and Analysis
In this section the performance of the proposed methodol-ogy is shown by presenting both numerical and visualizedresults for all performance aspects over various algorithms Acase study of road traffic is used in the experimentThe spatialarea is a metropolitan traffic map with roads and streetsspanning all over the place The resource value in this case isthe concentration or density of vehicle traffic flows Sensorsare assumed to have been deployed in every appropriate point
Table 1 Comparison between Bwmorph function and thinningalgorithm
Bwmorph function Thinning algorithmDataset 1 Dataset 2 Dataset 1 Dataset 2
Degree ofthinning Incomplete Complete
Elapsed time(secs) 20 38 100 198
Complexity 119874(119899) 119874(1198992)
of the roads thereby a typical traffic volume is each of thesepoints is known The optimization of spatial clustering inthis case can be thought of as optimal resource allocationfor example cost-effective police patrols gas stations orenvironment-pollution controls are needed among thosedense traffic spots
51 Data Preprocessing Two different factual datasets areused for experiments The first dataset is published byMaricopa Association of Governments in 2008 which isa traffic volume map Traffic volumes were derived fromthe national traffic recording devices Seasonal variation isfactored into the volumes The second dataset is an annualaverage daily traffic of Baltimore County Traffic VolumeMapin 2011 in USA prepared by the Maryland Department ofTransportation and published by March 19 2012 The trafficcount estimates are derived by taking 48-hourmachine countdata and applying factors frompermanent count stationsThetraffic counts represent the resource values in a general sense
After using skeleton extraction a two-tone image wasobtained from the original map Readers are referred to therespective websites where they can see the traffic volume datathat are associated with our two datasets (a) Representativetraffic volume map of dataset 1mdashTraffic Volume Map ofPhoenix AZUSA (httpphoenixgovstreetstrafficvolume-map) (b) Representative traffic volume map of dataset2mdashTraffic Volume Map of Baltimore MD USA (httpwwwmarylandroadscomTraffic Volume MapsTraffic VolumeMapspdf) And the corresponding result skeleton extraction
8 International Journal of Distributed Sensor Networks
Table 2 Important statistics from the clustering and LP experiments
Method Cluster number Number of cells covered Minimum Maximum Overlap
KM
Cluster 1 428 0 3499327 0Cluster 2 468 0 546896 0Cluster 3 448 0 20503007 0Cluster 4 614 0 6894667 0Cluster 5 618 0 900908 0
XM
Cluster 1 615 0 591265 0Cluster 2 457 0 546896 0Cluster 3 609 0 900908 0Cluster 4 465 0 3499327 0Cluster 5 430 0 20503007 0
EM
Cluster 1 1223 0 2292 61817229Cluster 2 7 141048 243705 313018Cluster 3 81 0 3033733 131146577Cluster 4 64 26752 546896 330881249Cluster 5 1201 0 1300026 217950471
DB
Cluster 1 13 23614 33146 327222911Cluster 2 11 1686825 21001 363965818Cluster 3 13 178888 2945283 196118393Cluster 4 11 847733 211008 58940877Cluster 5 2528 0 546896 20554176
HC
Cluster 1 291 0 3499327 0Cluster 2 191 0 20503007 96762283Cluster 3 294 0 1590971 0Cluster 4 224 0 189812 12673555Cluster 5 243 0 546896 0
LP
Cluster 1 221 0 3499327 0Cluster 2 221 0 20503007 0Cluster 3 221 0 1590971 0Cluster 4 221 0 189812 0Cluster 5 221 0 546896 0
Table 3 Comparison for running time of the first dataset
Formats KM HC DBscan XM EM LPVector database 327 1252 2324 278 930 183Raster database 342 1536 2820 284 984 201RasterP (16 grids) 198 134 508 046 057 078RasterP (25 grids) 009 014 115 021 012 053
in dataset 1 is shown in Figure 8 where (a) adopted a kind ofmorphological operation method and (b) adopted thinningalgorithm respectively Likewise the corresponding resultskeleton extraction in the second dataset is shown inFigure 9 where (a) adopted a kind of morphologicaloperation method and (b) adopted thinning algorithmrespectively The comparison result of the two datasets isshown in Table 1
For the raw dataset we firstly perform the image prepro-cessing over it to obtain numerical database
The results of the skeleton extraction as shown in Figures8(b) and 9(b) are more clearly and useful for the following
processing Subsequently the clustering by grid can bereadily obtained from the preprocessed imagesThe extent ofimage thinning is better and more complete by the thinningalgorithm than the Bwmorph function in MATLAB But theelapsed time is longer due to a two-layer iteration nestingprocedure in the program code
The choice of placing a grid on the image follows oneprinciple mesh segmentation is not trying to fall on a con-centrated position of traffic flow Since there is no endpointthe midpoint of the two adjacent values was considered ademarcation point Under this assumption the traffic flow ineach grid is calculated and stored digitally in an Excel file Adigital data for the trafficmap serves as the initial data for thesubsequent clustering process
52 Comparison Result of KM and HC Clustering InXLMiner two methods were used to perform clustering KMand HC In order to compare the two methods for the twodatasets input variables were normalized and the numberof clusters is set at five and maximum iterations at 100 Theinitial centroids are chosen randomly at start Furthermore
International Journal of Distributed Sensor Networks 9
Table 4 Comparison for log-likelihood of first dataset
Formats KM HC DBScan XM EMVector database minus1241868 minus1407265 minus1328599 minus119533 minus1249562Raster database minus1342238 minus1502863 minus1378889 minus129632 minus1339769RasterP (16 grids) 1262264 minus1402266 minus1248583 minus1239419 minus1244993RasterP (25 grids) minus1241868 minus1319417 minus1122207 minus1248201 minus1162048
(a) (b)
Figure 8 (a) Result of skeleton extraction in dataset 1 using Bwmorph function (b) Result of skeleton extraction in dataset 1 using thinningalgorithm
(a) (b)
Figure 9 (a) Result of skeleton extraction in dataset 2 using Bwmorph function (b) Result of skeleton extraction in dataset 2 using thinningalgorithm
Table 5 Comparison for running time of the second dataset
Formats KM HC DBScan XM EM LPVector database 139 134 1553 153 1005 337Raster database 241 1478 1834 217 823 196RasterP (16 grids) 047 801 1274 045 377 144RasterP (25 grids) 035 620 1098 036 296 118
the weights for the corresponding three attributes (119909 119910 V)for each grid (119892
119894= (119909119894 119910119894 V119894)) based on defining weight of
119909 and 119910 could be varied (fine-tuned) and the sum of weightsmust be equal to 1 We tested several variations searching forthe best clustering results (1) weight of V is 20 (2) weightof V is 40 (3) weight of V is 50 (4) weight of V is 60 (5)weight of V is 80 (6) all of them have same weight at 333
(7) weight of V is 0 (8) same weight except when 119892119894(V119894= 0)
and (9) weights of 119909 and 119910 are both 0 except when 119892119894(V119894= 0)
In HC method normalization of the input data waschosen Another option available is similarity measure Itadopts Euclidean distance to measure raw numeric dataMeanwhile the other two options Jaccardrsquos coefficients andmatching coefficient are activated only when the data isbinary
For the above nine cases results of cases (1) to (6) aresimilar in their separate methods And result of (9) is theworst which does not accomplish any clustering Results ofcases (2) (3) (7) and (8) are demonstrated in Figure 10
For the distribution of clusters in the result of KMclustering method more than half of data points are clampedinto one oversized cluster The result of this method istherefore not helpful for further operation For HC methoddata on average are allocated into separate clustersThe result
10 International Journal of Distributed Sensor Networks
Table 6 Comparison for log-likelihood of second dataset
Formats KM HC DBScan XM EMVector database minus1735412 minus1962367 minus1753576 minus1721513 minus1657263Raster database minus1815926 minus2012568 minus1970756 minus1815791 minus1848209RasterP (16 grids) minus1551437 minus1724736 minus1637147 minus1701283 minus1566231RasterP (25 grids) minus1484761 minus1663789 minus1509146 minus1667312 minus1647823
1
2
5
4
41
3
32
5
(a)
1
2 3
4
5
4 51
3
2
(b)
5
5
4
1
4
1 3
3
2
2
5
(c)
2
2
4
3
3
1 5
5
4
1
(d)
Figure 10 (a) Clustering results for the first dataset with setting case (2) where weight of V is 40 top half uses KM clustering methodand bottom half uses HC method (b) Clustering results for the first dataset with setting case (3) where weight of V is 50 top half uses KMclustering method and bottom half uses HC method (c) Clustering results for the first dataset with setting case (7) where weight of V is 0top half uses KM clustering method and bottom half uses HC method (d) Clustering results for the first dataset with setting case (3) whereall share the same weight except 119892
119894(V119894= 0) top half uses KM clustering method and bottom half uses HC method
Table 7 Comparison of running time (in seconds) of four differentsizes of dataset
Dataset size KM HC DBScan XM EM LP100 grid cells 006 007 105 219 321 0184600 grid cells 042 295 3989 273 1905 93710000 grid cells 262 4667 9755 297 3785 242180000 grid cells 1975 18961 684 647 19831 9083
in Figure 10(c) is the best showing only the one with distinctposition attributes (119909 and 119910) The other three results (Figures10(a) 10(b) and 10(d)) are stained with cluster overlapsTherefore allocation of critical resource for example in eachcluster may result in a waste of resources The degree ofoverlap is the least in the result of Figure 10(b) If only locationis being considered the result of Figure 10(c) is the bestchoice Otherwise the result in Figure 10(b) is better than theother two for the sake of cluster distribution
The clustering results of the second dataset performanceby using the two methods KM and HC are shown inFigure 11
From the results of the cluster distribution of the seconddataset obtained by both clustering methods the size of eachcluster is more or less similar which is better than that ofthe first dataset And there is no overlap phenomenon inthe KM results This is a promising feature of KM methodfor spatial clustering However there is little overlap in theresult of HC method as the clusters seem to take irregularshapes Above all for the second dataset KM is a better choicefor consideration of even cluster distribution and overlapavoidance by using both clustering methods
53 Results of Grouping In this part we compare the coloredmap of Raster (119909 119910 V) data model in two datasets usingfive clustering methods in Weka and the LP method Thecommon requirement is no overlap for each of the resultingmaps The number of cluster is arbitrarily chosen at five The
International Journal of Distributed Sensor Networks 11
Table 8 Numeric results of coverage of each cluster by using the six methods for dataset 1
Cov-db1 KM EM DBScan XM HC LPCluster 0 0029436 0003786 0017902 0075178 0013153 0028985Cluster 1 0301538 0269602 0208078 0049761 0026016 0377034Cluster 2 0215277 0001627 0158439 0084049 012436 0080099Cluster 3 0046788 0096221 0079177 0209390 0001172 0217204Cluster 4 0002712 0161799 0044197 0043152 03043 0007704Total coverage 0595751 0533036 0507793 0461531 0469 0711025
41 5
3 2
(a)
4
13
5
2
(b)
Figure 11 (a) Clustering results for the second dataset by usingKMmethod (b) Clustering results for the second dataset by usingHCmethod
result of first dataset is shown in Figure 12 The first part (i)of Figure 12 shows the spatial clustering result the secondpart (ii) visualizes the corresponding spatial groups by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method The centers of the clustersare computed after clustering is done and then the groupsare visualized over the clustering results according to thecomputed centers
In Figure 12 for the results of (a) and (e) the sizes ofclusters are quite uneven more than half of dataset fall intoone cluster Thus this result reveals a fact that the techniquecannot organize a dataset into homogeneous andor well-separated groups with respect to a distance or equivalentlya similarity measureThe corresponding groups have overlapphenomenon too For the result of (c) the sizes of the clustersare uneven too For the result of (b) and (d) the sizes of clusterseem to be similar to each other There is also no overlapin the clustering result but for group result the groups in(d) have far more overlaps than those in (b) Overlap meanssome part or the cluster gets in the way of another onewhichmeans that there is superposition between two ormoredifferent clusters Again itmay cause resourcewaste and evenfalse allocation This situation occurs in important fields ofapplications such as information retrieval (several thematicfor a single document) and biological data (several metabolicfunctions for one gene) For this reason (b) is better than (d)According to the above analysis for the result of clusteringand corresponding groups (d) XM is so far the best choiceof clustering algorithm as evidenced by the colored mapsthereafter
With the same experiment setup and operating environ-ment the spatial clustering experiments are performed overthe second dataset The results of second dataset are shown
in Figure 13 where (i) represents the spatial clustering resultand (ii) represents the corresponding spatial group by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method
In Figures 13(a) and 13(e) it is noticed that the clustersare imbalanced and there are overlaps in the correspondingspatial groups using the method of (a) EM and (e) DBScanThe results of (b) KM and (d) XM however avoid theshortcomings of (a) and (e) though they still have slightoverlaps For (c) HC we remove the empty cells in theboundary to reduce the size of dataset the clustering resultis perfect There is no overlap and clusters are balancedbetween each other But there is still overlap in the spatialgroups Thus LP method is adopted to solve this problemand in possession of same size of groups The result of LPmethod yields perfectly balanced groups without any overlapas shown in Figure 13(f)
By visually comparing the clustering results of the twodatasets the clustering results seem to be similar but thespatial groups are somewhat differentOccurrence of overlapsin spatial groups is more severe in the first dataset than in thesecond one The overlaps are likely due to data distributionand balance in sizes between each cluster For the first datasetthe heavy spatial values which are traffic volumes in this caseare mainly concentrated in the center region (city center) solocations of the computed clusters tend to cram very near ata crowded spot In contrast the traffic volumes in the seconddataset are dispersed over a large area As seen from the visualspatial groups of the second dataset the cluster positions area little far apart when compared to those in the first dataset
Based on the results generated from the clustering andLP experiments some statistic information of dataset 2 iscollected and it is shown in Table 2 The numeric results in
12 International Journal of Distributed Sensor Networks
(i)
(ii)
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(ii)
(i)
(d)
(i)
(ii)
(e)
Figure 12 (a) (i) Spatial clustering on dataset 1 by using EM (ii) Spatial groups on dataset 1 from the results of using EM (b) (i) spatialclustering on dataset 1 by using KM (ii) Spatial groups on dataset 1 from the results of using KM (c) (i) spatial clustering on dataset 1 byusing HC (ii) Spatial groups on dataset 1 from the results of using HC (d) (i) spatial clustering on dataset 1 by using XM (ii) Spatial groupson dataset 1 from the results of using XM (e) (i) spatial clustering on dataset 1 by using DBScan (ii) Spatial group on dataset 1 from theresults of using DBScan
Table 3 support the qualitative analysis by visual inspectionin the previous section By comparingHC and LPmethods asan example the quantitative results show that they have thegreatest differences in cell numbers covered by the clustersalso the amount of overlap in HC is the highest of all By theLP method the size of each cluster is exactly the same andthey are totally free from overlap
6 Technical Analysis of Clustering Results
61 Experimental Evaluation Method For the purpose ofassessing how the qualities of spatial groups from clusteringare several evaluation factors are defined here running time(short for time) balance log-likelihood overlap densityand coverage For a fair comparison the datasets are run inthe same software environment of the same computer And
assume the number of groups to be five with six differentmethods Running time is the time we used to run eachmethod using the same software in the same computerto completion Balance is used to measure the sizes ofgroups if balanced the size of each group is the same Log-likelihood is an important measure for clustering qualitythe bigger the value the better Weka tests for goodness-of-fit by the likelihood in logarithm which is called log-likelihood A large log-likelihood means that the clusteringmodel is suitable for the data under test Overlap means thatthe spatial values (eg traffic volumes sensed by the sensors)do belong to more than one cluster Density is the averagespatial values (traffic volumes) per grid cell in each clusterCoverage of a cluster means the proportion of traffic volumesthat are covered by the grid cells within the cluster overthe whole dataset meanwhile total coverage is the sum of
International Journal of Distributed Sensor Networks 13
(i)
(ii)
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(i)
(ii)
(d)
(i)
(ii)
(e)
(f)
Figure 13 (a) (i) spatial clustering on dataset 2 by using EM (ii) Spatial groups on dataset 2 from the results of using EM (b) (i) spatialclustering on dataset 2 by using KM (ii) Spatial groups on dataset 2 from the results of using KM (c) (i) spatial clustering on dataset 2 byusing HC (ii) Spatial groups on dataset 2 from the results of using HC (d) (i) spatial clustering on dataset 2 by using XM (ii) Spatial groupson dataset 2 from the results of using XM (e) (i) spatial clustering on dataset 2 by using DBScan (ii) Spatial group on dataset 2 from theresults of using DBScan (f) Spatial group in LP method on dataset 2
14 International Journal of Distributed Sensor Networks
traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below
Density (cluster 119894) =sumTraffic Volumes (cluster 119894)
Grid Cell Number (cluster 119894)
Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)
sumGrid Cell Number
Total Coverage = sumTraffic Volumes minusOverlaps
Proportion of Cluster (119894) Size (Balance)
=Grid Cell Number (cluster 119894)
sumGrid Cell Number
(4)
62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively
According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest
In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one
In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14
According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On
0
100
200
300
400
500
600
700
800
900
0 20000 40000 60000 80000 100000
K-meansHierarchicalDBScanXMeanEM
Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP
K-means)
Figure 14 Comparison of running time (in seconds) of differentsizes of dataset
the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally
In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one
In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC
The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets
From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage
International Journal of Distributed Sensor Networks 15
Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2
Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700
0
01
02
03
04
05
06
07
08
KM EM DBScan XM HC LP
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
CoverageCoverage
(a)
0
01
02
03
04
05
06
KM EM DBScan XM HC LP
Coverage
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
Coverage
(b)
Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2
Total density
0
200
400
600
800
1000
1200
1400
1600
KM EM DBScan XM HC LP
Density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(a)
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
KM EM DBScan XM HC LP
Density
Total density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(b)
Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2
16 International Journal of Distributed Sensor Networks
4
5136
8
1
Balance test on dataset 1
(a) KM
1
50
1
18
30
Balance test on dataset 1
(b) XM
6
22
24
30
18
Balance test on dataset 1
(c) EM
24
24
17
20
15
Balance test on dataset 1
(d) DBScan
18
17
22
19
25
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP
From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means
that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered
International Journal of Distributed Sensor Networks 17
17
18
17
24
24
Balance test on dataset 2
(a) KM
24
18
24
18
17
Balance test on dataset 2
(b) XM
47
032
47
Balance test on dataset 2
(c) EM
1010
98
Balance test on dataset 2
(d) DBScan
23
15
24
18
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP
The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely
63 Discussion of G119899119890119905 For all the six evaluation factors each
of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in
18 International Journal of Distributed Sensor Networks
Table 10 Numeric results of density of each cluster by using the six methods for dataset 1
Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049
Table 11 Numeric results of density of each cluster by using the six methods for dataset 2
Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447
order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866
119888can take a relatively very large value or even
1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes
119866119897=
10038161003816100381610038161003816100381610038161003816
LikelihoodTime
10038161003816100381610038161003816100381610038161003816
(5)
119866119887=Difference of Balance
Time (6)
119866119889=DensityTime
(7)
119866119888=CoverageTime
(8)
119866119900=OverlapTime
(9)
119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)
Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)
From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in
Table 12 which allows us to easily compare various methodsand performance aspects
In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866
119897 119866119887 119866119889 119866119888 and 119866
119900are computed
for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13
According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion
7 Conclusion and Future Works
Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for
International Journal of Distributed Sensor Networks 19
Table 12 Performance indicators of the six methods based on dataset 2
Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0
Table 13 Comparison of different clustering and LP methods by119866net indicator
Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132
purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users
The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes
For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be
good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed
References
[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000
[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012
[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012
[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012
[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012
[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003
[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006
[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002
[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008
[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006
[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004
[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998
[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000
20 International Journal of Distributed Sensor Networks
[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004
[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005
[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010
[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998
[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006
[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999
[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001
[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009
[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999
[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000
[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996
[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963
[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
8 International Journal of Distributed Sensor Networks
Table 2 Important statistics from the clustering and LP experiments
Method Cluster number Number of cells covered Minimum Maximum Overlap
KM
Cluster 1 428 0 3499327 0Cluster 2 468 0 546896 0Cluster 3 448 0 20503007 0Cluster 4 614 0 6894667 0Cluster 5 618 0 900908 0
XM
Cluster 1 615 0 591265 0Cluster 2 457 0 546896 0Cluster 3 609 0 900908 0Cluster 4 465 0 3499327 0Cluster 5 430 0 20503007 0
EM
Cluster 1 1223 0 2292 61817229Cluster 2 7 141048 243705 313018Cluster 3 81 0 3033733 131146577Cluster 4 64 26752 546896 330881249Cluster 5 1201 0 1300026 217950471
DB
Cluster 1 13 23614 33146 327222911Cluster 2 11 1686825 21001 363965818Cluster 3 13 178888 2945283 196118393Cluster 4 11 847733 211008 58940877Cluster 5 2528 0 546896 20554176
HC
Cluster 1 291 0 3499327 0Cluster 2 191 0 20503007 96762283Cluster 3 294 0 1590971 0Cluster 4 224 0 189812 12673555Cluster 5 243 0 546896 0
LP
Cluster 1 221 0 3499327 0Cluster 2 221 0 20503007 0Cluster 3 221 0 1590971 0Cluster 4 221 0 189812 0Cluster 5 221 0 546896 0
Table 3 Comparison for running time of the first dataset
Formats KM HC DBscan XM EM LPVector database 327 1252 2324 278 930 183Raster database 342 1536 2820 284 984 201RasterP (16 grids) 198 134 508 046 057 078RasterP (25 grids) 009 014 115 021 012 053
in dataset 1 is shown in Figure 8 where (a) adopted a kind ofmorphological operation method and (b) adopted thinningalgorithm respectively Likewise the corresponding resultskeleton extraction in the second dataset is shown inFigure 9 where (a) adopted a kind of morphologicaloperation method and (b) adopted thinning algorithmrespectively The comparison result of the two datasets isshown in Table 1
For the raw dataset we firstly perform the image prepro-cessing over it to obtain numerical database
The results of the skeleton extraction as shown in Figures8(b) and 9(b) are more clearly and useful for the following
processing Subsequently the clustering by grid can bereadily obtained from the preprocessed imagesThe extent ofimage thinning is better and more complete by the thinningalgorithm than the Bwmorph function in MATLAB But theelapsed time is longer due to a two-layer iteration nestingprocedure in the program code
The choice of placing a grid on the image follows oneprinciple mesh segmentation is not trying to fall on a con-centrated position of traffic flow Since there is no endpointthe midpoint of the two adjacent values was considered ademarcation point Under this assumption the traffic flow ineach grid is calculated and stored digitally in an Excel file Adigital data for the trafficmap serves as the initial data for thesubsequent clustering process
52 Comparison Result of KM and HC Clustering InXLMiner two methods were used to perform clustering KMand HC In order to compare the two methods for the twodatasets input variables were normalized and the numberof clusters is set at five and maximum iterations at 100 Theinitial centroids are chosen randomly at start Furthermore
International Journal of Distributed Sensor Networks 9
Table 4 Comparison for log-likelihood of first dataset
Formats KM HC DBScan XM EMVector database minus1241868 minus1407265 minus1328599 minus119533 minus1249562Raster database minus1342238 minus1502863 minus1378889 minus129632 minus1339769RasterP (16 grids) 1262264 minus1402266 minus1248583 minus1239419 minus1244993RasterP (25 grids) minus1241868 minus1319417 minus1122207 minus1248201 minus1162048
(a) (b)
Figure 8 (a) Result of skeleton extraction in dataset 1 using Bwmorph function (b) Result of skeleton extraction in dataset 1 using thinningalgorithm
(a) (b)
Figure 9 (a) Result of skeleton extraction in dataset 2 using Bwmorph function (b) Result of skeleton extraction in dataset 2 using thinningalgorithm
Table 5 Comparison for running time of the second dataset
Formats KM HC DBScan XM EM LPVector database 139 134 1553 153 1005 337Raster database 241 1478 1834 217 823 196RasterP (16 grids) 047 801 1274 045 377 144RasterP (25 grids) 035 620 1098 036 296 118
the weights for the corresponding three attributes (119909 119910 V)for each grid (119892
119894= (119909119894 119910119894 V119894)) based on defining weight of
119909 and 119910 could be varied (fine-tuned) and the sum of weightsmust be equal to 1 We tested several variations searching forthe best clustering results (1) weight of V is 20 (2) weightof V is 40 (3) weight of V is 50 (4) weight of V is 60 (5)weight of V is 80 (6) all of them have same weight at 333
(7) weight of V is 0 (8) same weight except when 119892119894(V119894= 0)
and (9) weights of 119909 and 119910 are both 0 except when 119892119894(V119894= 0)
In HC method normalization of the input data waschosen Another option available is similarity measure Itadopts Euclidean distance to measure raw numeric dataMeanwhile the other two options Jaccardrsquos coefficients andmatching coefficient are activated only when the data isbinary
For the above nine cases results of cases (1) to (6) aresimilar in their separate methods And result of (9) is theworst which does not accomplish any clustering Results ofcases (2) (3) (7) and (8) are demonstrated in Figure 10
For the distribution of clusters in the result of KMclustering method more than half of data points are clampedinto one oversized cluster The result of this method istherefore not helpful for further operation For HC methoddata on average are allocated into separate clustersThe result
10 International Journal of Distributed Sensor Networks
Table 6 Comparison for log-likelihood of second dataset
Formats KM HC DBScan XM EMVector database minus1735412 minus1962367 minus1753576 minus1721513 minus1657263Raster database minus1815926 minus2012568 minus1970756 minus1815791 minus1848209RasterP (16 grids) minus1551437 minus1724736 minus1637147 minus1701283 minus1566231RasterP (25 grids) minus1484761 minus1663789 minus1509146 minus1667312 minus1647823
1
2
5
4
41
3
32
5
(a)
1
2 3
4
5
4 51
3
2
(b)
5
5
4
1
4
1 3
3
2
2
5
(c)
2
2
4
3
3
1 5
5
4
1
(d)
Figure 10 (a) Clustering results for the first dataset with setting case (2) where weight of V is 40 top half uses KM clustering methodand bottom half uses HC method (b) Clustering results for the first dataset with setting case (3) where weight of V is 50 top half uses KMclustering method and bottom half uses HC method (c) Clustering results for the first dataset with setting case (7) where weight of V is 0top half uses KM clustering method and bottom half uses HC method (d) Clustering results for the first dataset with setting case (3) whereall share the same weight except 119892
119894(V119894= 0) top half uses KM clustering method and bottom half uses HC method
Table 7 Comparison of running time (in seconds) of four differentsizes of dataset
Dataset size KM HC DBScan XM EM LP100 grid cells 006 007 105 219 321 0184600 grid cells 042 295 3989 273 1905 93710000 grid cells 262 4667 9755 297 3785 242180000 grid cells 1975 18961 684 647 19831 9083
in Figure 10(c) is the best showing only the one with distinctposition attributes (119909 and 119910) The other three results (Figures10(a) 10(b) and 10(d)) are stained with cluster overlapsTherefore allocation of critical resource for example in eachcluster may result in a waste of resources The degree ofoverlap is the least in the result of Figure 10(b) If only locationis being considered the result of Figure 10(c) is the bestchoice Otherwise the result in Figure 10(b) is better than theother two for the sake of cluster distribution
The clustering results of the second dataset performanceby using the two methods KM and HC are shown inFigure 11
From the results of the cluster distribution of the seconddataset obtained by both clustering methods the size of eachcluster is more or less similar which is better than that ofthe first dataset And there is no overlap phenomenon inthe KM results This is a promising feature of KM methodfor spatial clustering However there is little overlap in theresult of HC method as the clusters seem to take irregularshapes Above all for the second dataset KM is a better choicefor consideration of even cluster distribution and overlapavoidance by using both clustering methods
53 Results of Grouping In this part we compare the coloredmap of Raster (119909 119910 V) data model in two datasets usingfive clustering methods in Weka and the LP method Thecommon requirement is no overlap for each of the resultingmaps The number of cluster is arbitrarily chosen at five The
International Journal of Distributed Sensor Networks 11
Table 8 Numeric results of coverage of each cluster by using the six methods for dataset 1
Cov-db1 KM EM DBScan XM HC LPCluster 0 0029436 0003786 0017902 0075178 0013153 0028985Cluster 1 0301538 0269602 0208078 0049761 0026016 0377034Cluster 2 0215277 0001627 0158439 0084049 012436 0080099Cluster 3 0046788 0096221 0079177 0209390 0001172 0217204Cluster 4 0002712 0161799 0044197 0043152 03043 0007704Total coverage 0595751 0533036 0507793 0461531 0469 0711025
41 5
3 2
(a)
4
13
5
2
(b)
Figure 11 (a) Clustering results for the second dataset by usingKMmethod (b) Clustering results for the second dataset by usingHCmethod
result of first dataset is shown in Figure 12 The first part (i)of Figure 12 shows the spatial clustering result the secondpart (ii) visualizes the corresponding spatial groups by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method The centers of the clustersare computed after clustering is done and then the groupsare visualized over the clustering results according to thecomputed centers
In Figure 12 for the results of (a) and (e) the sizes ofclusters are quite uneven more than half of dataset fall intoone cluster Thus this result reveals a fact that the techniquecannot organize a dataset into homogeneous andor well-separated groups with respect to a distance or equivalentlya similarity measureThe corresponding groups have overlapphenomenon too For the result of (c) the sizes of the clustersare uneven too For the result of (b) and (d) the sizes of clusterseem to be similar to each other There is also no overlapin the clustering result but for group result the groups in(d) have far more overlaps than those in (b) Overlap meanssome part or the cluster gets in the way of another onewhichmeans that there is superposition between two ormoredifferent clusters Again itmay cause resourcewaste and evenfalse allocation This situation occurs in important fields ofapplications such as information retrieval (several thematicfor a single document) and biological data (several metabolicfunctions for one gene) For this reason (b) is better than (d)According to the above analysis for the result of clusteringand corresponding groups (d) XM is so far the best choiceof clustering algorithm as evidenced by the colored mapsthereafter
With the same experiment setup and operating environ-ment the spatial clustering experiments are performed overthe second dataset The results of second dataset are shown
in Figure 13 where (i) represents the spatial clustering resultand (ii) represents the corresponding spatial group by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method
In Figures 13(a) and 13(e) it is noticed that the clustersare imbalanced and there are overlaps in the correspondingspatial groups using the method of (a) EM and (e) DBScanThe results of (b) KM and (d) XM however avoid theshortcomings of (a) and (e) though they still have slightoverlaps For (c) HC we remove the empty cells in theboundary to reduce the size of dataset the clustering resultis perfect There is no overlap and clusters are balancedbetween each other But there is still overlap in the spatialgroups Thus LP method is adopted to solve this problemand in possession of same size of groups The result of LPmethod yields perfectly balanced groups without any overlapas shown in Figure 13(f)
By visually comparing the clustering results of the twodatasets the clustering results seem to be similar but thespatial groups are somewhat differentOccurrence of overlapsin spatial groups is more severe in the first dataset than in thesecond one The overlaps are likely due to data distributionand balance in sizes between each cluster For the first datasetthe heavy spatial values which are traffic volumes in this caseare mainly concentrated in the center region (city center) solocations of the computed clusters tend to cram very near ata crowded spot In contrast the traffic volumes in the seconddataset are dispersed over a large area As seen from the visualspatial groups of the second dataset the cluster positions area little far apart when compared to those in the first dataset
Based on the results generated from the clustering andLP experiments some statistic information of dataset 2 iscollected and it is shown in Table 2 The numeric results in
12 International Journal of Distributed Sensor Networks
(i)
(ii)
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(ii)
(i)
(d)
(i)
(ii)
(e)
Figure 12 (a) (i) Spatial clustering on dataset 1 by using EM (ii) Spatial groups on dataset 1 from the results of using EM (b) (i) spatialclustering on dataset 1 by using KM (ii) Spatial groups on dataset 1 from the results of using KM (c) (i) spatial clustering on dataset 1 byusing HC (ii) Spatial groups on dataset 1 from the results of using HC (d) (i) spatial clustering on dataset 1 by using XM (ii) Spatial groupson dataset 1 from the results of using XM (e) (i) spatial clustering on dataset 1 by using DBScan (ii) Spatial group on dataset 1 from theresults of using DBScan
Table 3 support the qualitative analysis by visual inspectionin the previous section By comparingHC and LPmethods asan example the quantitative results show that they have thegreatest differences in cell numbers covered by the clustersalso the amount of overlap in HC is the highest of all By theLP method the size of each cluster is exactly the same andthey are totally free from overlap
6 Technical Analysis of Clustering Results
61 Experimental Evaluation Method For the purpose ofassessing how the qualities of spatial groups from clusteringare several evaluation factors are defined here running time(short for time) balance log-likelihood overlap densityand coverage For a fair comparison the datasets are run inthe same software environment of the same computer And
assume the number of groups to be five with six differentmethods Running time is the time we used to run eachmethod using the same software in the same computerto completion Balance is used to measure the sizes ofgroups if balanced the size of each group is the same Log-likelihood is an important measure for clustering qualitythe bigger the value the better Weka tests for goodness-of-fit by the likelihood in logarithm which is called log-likelihood A large log-likelihood means that the clusteringmodel is suitable for the data under test Overlap means thatthe spatial values (eg traffic volumes sensed by the sensors)do belong to more than one cluster Density is the averagespatial values (traffic volumes) per grid cell in each clusterCoverage of a cluster means the proportion of traffic volumesthat are covered by the grid cells within the cluster overthe whole dataset meanwhile total coverage is the sum of
International Journal of Distributed Sensor Networks 13
(i)
(ii)
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(i)
(ii)
(d)
(i)
(ii)
(e)
(f)
Figure 13 (a) (i) spatial clustering on dataset 2 by using EM (ii) Spatial groups on dataset 2 from the results of using EM (b) (i) spatialclustering on dataset 2 by using KM (ii) Spatial groups on dataset 2 from the results of using KM (c) (i) spatial clustering on dataset 2 byusing HC (ii) Spatial groups on dataset 2 from the results of using HC (d) (i) spatial clustering on dataset 2 by using XM (ii) Spatial groupson dataset 2 from the results of using XM (e) (i) spatial clustering on dataset 2 by using DBScan (ii) Spatial group on dataset 2 from theresults of using DBScan (f) Spatial group in LP method on dataset 2
14 International Journal of Distributed Sensor Networks
traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below
Density (cluster 119894) =sumTraffic Volumes (cluster 119894)
Grid Cell Number (cluster 119894)
Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)
sumGrid Cell Number
Total Coverage = sumTraffic Volumes minusOverlaps
Proportion of Cluster (119894) Size (Balance)
=Grid Cell Number (cluster 119894)
sumGrid Cell Number
(4)
62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively
According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest
In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one
In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14
According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On
0
100
200
300
400
500
600
700
800
900
0 20000 40000 60000 80000 100000
K-meansHierarchicalDBScanXMeanEM
Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP
K-means)
Figure 14 Comparison of running time (in seconds) of differentsizes of dataset
the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally
In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one
In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC
The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets
From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage
International Journal of Distributed Sensor Networks 15
Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2
Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700
0
01
02
03
04
05
06
07
08
KM EM DBScan XM HC LP
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
CoverageCoverage
(a)
0
01
02
03
04
05
06
KM EM DBScan XM HC LP
Coverage
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
Coverage
(b)
Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2
Total density
0
200
400
600
800
1000
1200
1400
1600
KM EM DBScan XM HC LP
Density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(a)
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
KM EM DBScan XM HC LP
Density
Total density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(b)
Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2
16 International Journal of Distributed Sensor Networks
4
5136
8
1
Balance test on dataset 1
(a) KM
1
50
1
18
30
Balance test on dataset 1
(b) XM
6
22
24
30
18
Balance test on dataset 1
(c) EM
24
24
17
20
15
Balance test on dataset 1
(d) DBScan
18
17
22
19
25
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP
From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means
that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered
International Journal of Distributed Sensor Networks 17
17
18
17
24
24
Balance test on dataset 2
(a) KM
24
18
24
18
17
Balance test on dataset 2
(b) XM
47
032
47
Balance test on dataset 2
(c) EM
1010
98
Balance test on dataset 2
(d) DBScan
23
15
24
18
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP
The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely
63 Discussion of G119899119890119905 For all the six evaluation factors each
of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in
18 International Journal of Distributed Sensor Networks
Table 10 Numeric results of density of each cluster by using the six methods for dataset 1
Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049
Table 11 Numeric results of density of each cluster by using the six methods for dataset 2
Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447
order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866
119888can take a relatively very large value or even
1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes
119866119897=
10038161003816100381610038161003816100381610038161003816
LikelihoodTime
10038161003816100381610038161003816100381610038161003816
(5)
119866119887=Difference of Balance
Time (6)
119866119889=DensityTime
(7)
119866119888=CoverageTime
(8)
119866119900=OverlapTime
(9)
119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)
Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)
From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in
Table 12 which allows us to easily compare various methodsand performance aspects
In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866
119897 119866119887 119866119889 119866119888 and 119866
119900are computed
for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13
According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion
7 Conclusion and Future Works
Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for
International Journal of Distributed Sensor Networks 19
Table 12 Performance indicators of the six methods based on dataset 2
Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0
Table 13 Comparison of different clustering and LP methods by119866net indicator
Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132
purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users
The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes
For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be
good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed
References
[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000
[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012
[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012
[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012
[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012
[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003
[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006
[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002
[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008
[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006
[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004
[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998
[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000
20 International Journal of Distributed Sensor Networks
[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004
[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005
[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010
[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998
[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006
[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999
[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001
[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009
[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999
[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000
[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996
[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963
[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
International Journal of Distributed Sensor Networks 9
Table 4 Comparison for log-likelihood of first dataset
Formats KM HC DBScan XM EMVector database minus1241868 minus1407265 minus1328599 minus119533 minus1249562Raster database minus1342238 minus1502863 minus1378889 minus129632 minus1339769RasterP (16 grids) 1262264 minus1402266 minus1248583 minus1239419 minus1244993RasterP (25 grids) minus1241868 minus1319417 minus1122207 minus1248201 minus1162048
(a) (b)
Figure 8 (a) Result of skeleton extraction in dataset 1 using Bwmorph function (b) Result of skeleton extraction in dataset 1 using thinningalgorithm
(a) (b)
Figure 9 (a) Result of skeleton extraction in dataset 2 using Bwmorph function (b) Result of skeleton extraction in dataset 2 using thinningalgorithm
Table 5 Comparison for running time of the second dataset
Formats KM HC DBScan XM EM LPVector database 139 134 1553 153 1005 337Raster database 241 1478 1834 217 823 196RasterP (16 grids) 047 801 1274 045 377 144RasterP (25 grids) 035 620 1098 036 296 118
the weights for the corresponding three attributes (119909 119910 V)for each grid (119892
119894= (119909119894 119910119894 V119894)) based on defining weight of
119909 and 119910 could be varied (fine-tuned) and the sum of weightsmust be equal to 1 We tested several variations searching forthe best clustering results (1) weight of V is 20 (2) weightof V is 40 (3) weight of V is 50 (4) weight of V is 60 (5)weight of V is 80 (6) all of them have same weight at 333
(7) weight of V is 0 (8) same weight except when 119892119894(V119894= 0)
and (9) weights of 119909 and 119910 are both 0 except when 119892119894(V119894= 0)
In HC method normalization of the input data waschosen Another option available is similarity measure Itadopts Euclidean distance to measure raw numeric dataMeanwhile the other two options Jaccardrsquos coefficients andmatching coefficient are activated only when the data isbinary
For the above nine cases results of cases (1) to (6) aresimilar in their separate methods And result of (9) is theworst which does not accomplish any clustering Results ofcases (2) (3) (7) and (8) are demonstrated in Figure 10
For the distribution of clusters in the result of KMclustering method more than half of data points are clampedinto one oversized cluster The result of this method istherefore not helpful for further operation For HC methoddata on average are allocated into separate clustersThe result
10 International Journal of Distributed Sensor Networks
Table 6 Comparison for log-likelihood of second dataset
Formats KM HC DBScan XM EMVector database minus1735412 minus1962367 minus1753576 minus1721513 minus1657263Raster database minus1815926 minus2012568 minus1970756 minus1815791 minus1848209RasterP (16 grids) minus1551437 minus1724736 minus1637147 minus1701283 minus1566231RasterP (25 grids) minus1484761 minus1663789 minus1509146 minus1667312 minus1647823
1
2
5
4
41
3
32
5
(a)
1
2 3
4
5
4 51
3
2
(b)
5
5
4
1
4
1 3
3
2
2
5
(c)
2
2
4
3
3
1 5
5
4
1
(d)
Figure 10 (a) Clustering results for the first dataset with setting case (2) where weight of V is 40 top half uses KM clustering methodand bottom half uses HC method (b) Clustering results for the first dataset with setting case (3) where weight of V is 50 top half uses KMclustering method and bottom half uses HC method (c) Clustering results for the first dataset with setting case (7) where weight of V is 0top half uses KM clustering method and bottom half uses HC method (d) Clustering results for the first dataset with setting case (3) whereall share the same weight except 119892
119894(V119894= 0) top half uses KM clustering method and bottom half uses HC method
Table 7 Comparison of running time (in seconds) of four differentsizes of dataset
Dataset size KM HC DBScan XM EM LP100 grid cells 006 007 105 219 321 0184600 grid cells 042 295 3989 273 1905 93710000 grid cells 262 4667 9755 297 3785 242180000 grid cells 1975 18961 684 647 19831 9083
in Figure 10(c) is the best showing only the one with distinctposition attributes (119909 and 119910) The other three results (Figures10(a) 10(b) and 10(d)) are stained with cluster overlapsTherefore allocation of critical resource for example in eachcluster may result in a waste of resources The degree ofoverlap is the least in the result of Figure 10(b) If only locationis being considered the result of Figure 10(c) is the bestchoice Otherwise the result in Figure 10(b) is better than theother two for the sake of cluster distribution
The clustering results of the second dataset performanceby using the two methods KM and HC are shown inFigure 11
From the results of the cluster distribution of the seconddataset obtained by both clustering methods the size of eachcluster is more or less similar which is better than that ofthe first dataset And there is no overlap phenomenon inthe KM results This is a promising feature of KM methodfor spatial clustering However there is little overlap in theresult of HC method as the clusters seem to take irregularshapes Above all for the second dataset KM is a better choicefor consideration of even cluster distribution and overlapavoidance by using both clustering methods
53 Results of Grouping In this part we compare the coloredmap of Raster (119909 119910 V) data model in two datasets usingfive clustering methods in Weka and the LP method Thecommon requirement is no overlap for each of the resultingmaps The number of cluster is arbitrarily chosen at five The
International Journal of Distributed Sensor Networks 11
Table 8 Numeric results of coverage of each cluster by using the six methods for dataset 1
Cov-db1 KM EM DBScan XM HC LPCluster 0 0029436 0003786 0017902 0075178 0013153 0028985Cluster 1 0301538 0269602 0208078 0049761 0026016 0377034Cluster 2 0215277 0001627 0158439 0084049 012436 0080099Cluster 3 0046788 0096221 0079177 0209390 0001172 0217204Cluster 4 0002712 0161799 0044197 0043152 03043 0007704Total coverage 0595751 0533036 0507793 0461531 0469 0711025
41 5
3 2
(a)
4
13
5
2
(b)
Figure 11 (a) Clustering results for the second dataset by usingKMmethod (b) Clustering results for the second dataset by usingHCmethod
result of first dataset is shown in Figure 12 The first part (i)of Figure 12 shows the spatial clustering result the secondpart (ii) visualizes the corresponding spatial groups by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method The centers of the clustersare computed after clustering is done and then the groupsare visualized over the clustering results according to thecomputed centers
In Figure 12 for the results of (a) and (e) the sizes ofclusters are quite uneven more than half of dataset fall intoone cluster Thus this result reveals a fact that the techniquecannot organize a dataset into homogeneous andor well-separated groups with respect to a distance or equivalentlya similarity measureThe corresponding groups have overlapphenomenon too For the result of (c) the sizes of the clustersare uneven too For the result of (b) and (d) the sizes of clusterseem to be similar to each other There is also no overlapin the clustering result but for group result the groups in(d) have far more overlaps than those in (b) Overlap meanssome part or the cluster gets in the way of another onewhichmeans that there is superposition between two ormoredifferent clusters Again itmay cause resourcewaste and evenfalse allocation This situation occurs in important fields ofapplications such as information retrieval (several thematicfor a single document) and biological data (several metabolicfunctions for one gene) For this reason (b) is better than (d)According to the above analysis for the result of clusteringand corresponding groups (d) XM is so far the best choiceof clustering algorithm as evidenced by the colored mapsthereafter
With the same experiment setup and operating environ-ment the spatial clustering experiments are performed overthe second dataset The results of second dataset are shown
in Figure 13 where (i) represents the spatial clustering resultand (ii) represents the corresponding spatial group by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method
In Figures 13(a) and 13(e) it is noticed that the clustersare imbalanced and there are overlaps in the correspondingspatial groups using the method of (a) EM and (e) DBScanThe results of (b) KM and (d) XM however avoid theshortcomings of (a) and (e) though they still have slightoverlaps For (c) HC we remove the empty cells in theboundary to reduce the size of dataset the clustering resultis perfect There is no overlap and clusters are balancedbetween each other But there is still overlap in the spatialgroups Thus LP method is adopted to solve this problemand in possession of same size of groups The result of LPmethod yields perfectly balanced groups without any overlapas shown in Figure 13(f)
By visually comparing the clustering results of the twodatasets the clustering results seem to be similar but thespatial groups are somewhat differentOccurrence of overlapsin spatial groups is more severe in the first dataset than in thesecond one The overlaps are likely due to data distributionand balance in sizes between each cluster For the first datasetthe heavy spatial values which are traffic volumes in this caseare mainly concentrated in the center region (city center) solocations of the computed clusters tend to cram very near ata crowded spot In contrast the traffic volumes in the seconddataset are dispersed over a large area As seen from the visualspatial groups of the second dataset the cluster positions area little far apart when compared to those in the first dataset
Based on the results generated from the clustering andLP experiments some statistic information of dataset 2 iscollected and it is shown in Table 2 The numeric results in
12 International Journal of Distributed Sensor Networks
(i)
(ii)
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(ii)
(i)
(d)
(i)
(ii)
(e)
Figure 12 (a) (i) Spatial clustering on dataset 1 by using EM (ii) Spatial groups on dataset 1 from the results of using EM (b) (i) spatialclustering on dataset 1 by using KM (ii) Spatial groups on dataset 1 from the results of using KM (c) (i) spatial clustering on dataset 1 byusing HC (ii) Spatial groups on dataset 1 from the results of using HC (d) (i) spatial clustering on dataset 1 by using XM (ii) Spatial groupson dataset 1 from the results of using XM (e) (i) spatial clustering on dataset 1 by using DBScan (ii) Spatial group on dataset 1 from theresults of using DBScan
Table 3 support the qualitative analysis by visual inspectionin the previous section By comparingHC and LPmethods asan example the quantitative results show that they have thegreatest differences in cell numbers covered by the clustersalso the amount of overlap in HC is the highest of all By theLP method the size of each cluster is exactly the same andthey are totally free from overlap
6 Technical Analysis of Clustering Results
61 Experimental Evaluation Method For the purpose ofassessing how the qualities of spatial groups from clusteringare several evaluation factors are defined here running time(short for time) balance log-likelihood overlap densityand coverage For a fair comparison the datasets are run inthe same software environment of the same computer And
assume the number of groups to be five with six differentmethods Running time is the time we used to run eachmethod using the same software in the same computerto completion Balance is used to measure the sizes ofgroups if balanced the size of each group is the same Log-likelihood is an important measure for clustering qualitythe bigger the value the better Weka tests for goodness-of-fit by the likelihood in logarithm which is called log-likelihood A large log-likelihood means that the clusteringmodel is suitable for the data under test Overlap means thatthe spatial values (eg traffic volumes sensed by the sensors)do belong to more than one cluster Density is the averagespatial values (traffic volumes) per grid cell in each clusterCoverage of a cluster means the proportion of traffic volumesthat are covered by the grid cells within the cluster overthe whole dataset meanwhile total coverage is the sum of
International Journal of Distributed Sensor Networks 13
(i)
(ii)
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(i)
(ii)
(d)
(i)
(ii)
(e)
(f)
Figure 13 (a) (i) spatial clustering on dataset 2 by using EM (ii) Spatial groups on dataset 2 from the results of using EM (b) (i) spatialclustering on dataset 2 by using KM (ii) Spatial groups on dataset 2 from the results of using KM (c) (i) spatial clustering on dataset 2 byusing HC (ii) Spatial groups on dataset 2 from the results of using HC (d) (i) spatial clustering on dataset 2 by using XM (ii) Spatial groupson dataset 2 from the results of using XM (e) (i) spatial clustering on dataset 2 by using DBScan (ii) Spatial group on dataset 2 from theresults of using DBScan (f) Spatial group in LP method on dataset 2
14 International Journal of Distributed Sensor Networks
traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below
Density (cluster 119894) =sumTraffic Volumes (cluster 119894)
Grid Cell Number (cluster 119894)
Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)
sumGrid Cell Number
Total Coverage = sumTraffic Volumes minusOverlaps
Proportion of Cluster (119894) Size (Balance)
=Grid Cell Number (cluster 119894)
sumGrid Cell Number
(4)
62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively
According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest
In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one
In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14
According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On
0
100
200
300
400
500
600
700
800
900
0 20000 40000 60000 80000 100000
K-meansHierarchicalDBScanXMeanEM
Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP
K-means)
Figure 14 Comparison of running time (in seconds) of differentsizes of dataset
the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally
In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one
In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC
The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets
From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage
International Journal of Distributed Sensor Networks 15
Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2
Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700
0
01
02
03
04
05
06
07
08
KM EM DBScan XM HC LP
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
CoverageCoverage
(a)
0
01
02
03
04
05
06
KM EM DBScan XM HC LP
Coverage
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
Coverage
(b)
Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2
Total density
0
200
400
600
800
1000
1200
1400
1600
KM EM DBScan XM HC LP
Density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(a)
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
KM EM DBScan XM HC LP
Density
Total density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(b)
Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2
16 International Journal of Distributed Sensor Networks
4
5136
8
1
Balance test on dataset 1
(a) KM
1
50
1
18
30
Balance test on dataset 1
(b) XM
6
22
24
30
18
Balance test on dataset 1
(c) EM
24
24
17
20
15
Balance test on dataset 1
(d) DBScan
18
17
22
19
25
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP
From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means
that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered
International Journal of Distributed Sensor Networks 17
17
18
17
24
24
Balance test on dataset 2
(a) KM
24
18
24
18
17
Balance test on dataset 2
(b) XM
47
032
47
Balance test on dataset 2
(c) EM
1010
98
Balance test on dataset 2
(d) DBScan
23
15
24
18
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP
The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely
63 Discussion of G119899119890119905 For all the six evaluation factors each
of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in
18 International Journal of Distributed Sensor Networks
Table 10 Numeric results of density of each cluster by using the six methods for dataset 1
Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049
Table 11 Numeric results of density of each cluster by using the six methods for dataset 2
Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447
order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866
119888can take a relatively very large value or even
1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes
119866119897=
10038161003816100381610038161003816100381610038161003816
LikelihoodTime
10038161003816100381610038161003816100381610038161003816
(5)
119866119887=Difference of Balance
Time (6)
119866119889=DensityTime
(7)
119866119888=CoverageTime
(8)
119866119900=OverlapTime
(9)
119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)
Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)
From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in
Table 12 which allows us to easily compare various methodsand performance aspects
In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866
119897 119866119887 119866119889 119866119888 and 119866
119900are computed
for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13
According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion
7 Conclusion and Future Works
Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for
International Journal of Distributed Sensor Networks 19
Table 12 Performance indicators of the six methods based on dataset 2
Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0
Table 13 Comparison of different clustering and LP methods by119866net indicator
Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132
purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users
The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes
For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be
good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed
References
[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000
[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012
[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012
[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012
[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012
[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003
[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006
[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002
[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008
[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006
[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004
[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998
[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000
20 International Journal of Distributed Sensor Networks
[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004
[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005
[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010
[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998
[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006
[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999
[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001
[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009
[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999
[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000
[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996
[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963
[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
10 International Journal of Distributed Sensor Networks
Table 6 Comparison for log-likelihood of second dataset
Formats KM HC DBScan XM EMVector database minus1735412 minus1962367 minus1753576 minus1721513 minus1657263Raster database minus1815926 minus2012568 minus1970756 minus1815791 minus1848209RasterP (16 grids) minus1551437 minus1724736 minus1637147 minus1701283 minus1566231RasterP (25 grids) minus1484761 minus1663789 minus1509146 minus1667312 minus1647823
1
2
5
4
41
3
32
5
(a)
1
2 3
4
5
4 51
3
2
(b)
5
5
4
1
4
1 3
3
2
2
5
(c)
2
2
4
3
3
1 5
5
4
1
(d)
Figure 10 (a) Clustering results for the first dataset with setting case (2) where weight of V is 40 top half uses KM clustering methodand bottom half uses HC method (b) Clustering results for the first dataset with setting case (3) where weight of V is 50 top half uses KMclustering method and bottom half uses HC method (c) Clustering results for the first dataset with setting case (7) where weight of V is 0top half uses KM clustering method and bottom half uses HC method (d) Clustering results for the first dataset with setting case (3) whereall share the same weight except 119892
119894(V119894= 0) top half uses KM clustering method and bottom half uses HC method
Table 7 Comparison of running time (in seconds) of four differentsizes of dataset
Dataset size KM HC DBScan XM EM LP100 grid cells 006 007 105 219 321 0184600 grid cells 042 295 3989 273 1905 93710000 grid cells 262 4667 9755 297 3785 242180000 grid cells 1975 18961 684 647 19831 9083
in Figure 10(c) is the best showing only the one with distinctposition attributes (119909 and 119910) The other three results (Figures10(a) 10(b) and 10(d)) are stained with cluster overlapsTherefore allocation of critical resource for example in eachcluster may result in a waste of resources The degree ofoverlap is the least in the result of Figure 10(b) If only locationis being considered the result of Figure 10(c) is the bestchoice Otherwise the result in Figure 10(b) is better than theother two for the sake of cluster distribution
The clustering results of the second dataset performanceby using the two methods KM and HC are shown inFigure 11
From the results of the cluster distribution of the seconddataset obtained by both clustering methods the size of eachcluster is more or less similar which is better than that ofthe first dataset And there is no overlap phenomenon inthe KM results This is a promising feature of KM methodfor spatial clustering However there is little overlap in theresult of HC method as the clusters seem to take irregularshapes Above all for the second dataset KM is a better choicefor consideration of even cluster distribution and overlapavoidance by using both clustering methods
53 Results of Grouping In this part we compare the coloredmap of Raster (119909 119910 V) data model in two datasets usingfive clustering methods in Weka and the LP method Thecommon requirement is no overlap for each of the resultingmaps The number of cluster is arbitrarily chosen at five The
International Journal of Distributed Sensor Networks 11
Table 8 Numeric results of coverage of each cluster by using the six methods for dataset 1
Cov-db1 KM EM DBScan XM HC LPCluster 0 0029436 0003786 0017902 0075178 0013153 0028985Cluster 1 0301538 0269602 0208078 0049761 0026016 0377034Cluster 2 0215277 0001627 0158439 0084049 012436 0080099Cluster 3 0046788 0096221 0079177 0209390 0001172 0217204Cluster 4 0002712 0161799 0044197 0043152 03043 0007704Total coverage 0595751 0533036 0507793 0461531 0469 0711025
41 5
3 2
(a)
4
13
5
2
(b)
Figure 11 (a) Clustering results for the second dataset by usingKMmethod (b) Clustering results for the second dataset by usingHCmethod
result of first dataset is shown in Figure 12 The first part (i)of Figure 12 shows the spatial clustering result the secondpart (ii) visualizes the corresponding spatial groups by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method The centers of the clustersare computed after clustering is done and then the groupsare visualized over the clustering results according to thecomputed centers
In Figure 12 for the results of (a) and (e) the sizes ofclusters are quite uneven more than half of dataset fall intoone cluster Thus this result reveals a fact that the techniquecannot organize a dataset into homogeneous andor well-separated groups with respect to a distance or equivalentlya similarity measureThe corresponding groups have overlapphenomenon too For the result of (c) the sizes of the clustersare uneven too For the result of (b) and (d) the sizes of clusterseem to be similar to each other There is also no overlapin the clustering result but for group result the groups in(d) have far more overlaps than those in (b) Overlap meanssome part or the cluster gets in the way of another onewhichmeans that there is superposition between two ormoredifferent clusters Again itmay cause resourcewaste and evenfalse allocation This situation occurs in important fields ofapplications such as information retrieval (several thematicfor a single document) and biological data (several metabolicfunctions for one gene) For this reason (b) is better than (d)According to the above analysis for the result of clusteringand corresponding groups (d) XM is so far the best choiceof clustering algorithm as evidenced by the colored mapsthereafter
With the same experiment setup and operating environ-ment the spatial clustering experiments are performed overthe second dataset The results of second dataset are shown
in Figure 13 where (i) represents the spatial clustering resultand (ii) represents the corresponding spatial group by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method
In Figures 13(a) and 13(e) it is noticed that the clustersare imbalanced and there are overlaps in the correspondingspatial groups using the method of (a) EM and (e) DBScanThe results of (b) KM and (d) XM however avoid theshortcomings of (a) and (e) though they still have slightoverlaps For (c) HC we remove the empty cells in theboundary to reduce the size of dataset the clustering resultis perfect There is no overlap and clusters are balancedbetween each other But there is still overlap in the spatialgroups Thus LP method is adopted to solve this problemand in possession of same size of groups The result of LPmethod yields perfectly balanced groups without any overlapas shown in Figure 13(f)
By visually comparing the clustering results of the twodatasets the clustering results seem to be similar but thespatial groups are somewhat differentOccurrence of overlapsin spatial groups is more severe in the first dataset than in thesecond one The overlaps are likely due to data distributionand balance in sizes between each cluster For the first datasetthe heavy spatial values which are traffic volumes in this caseare mainly concentrated in the center region (city center) solocations of the computed clusters tend to cram very near ata crowded spot In contrast the traffic volumes in the seconddataset are dispersed over a large area As seen from the visualspatial groups of the second dataset the cluster positions area little far apart when compared to those in the first dataset
Based on the results generated from the clustering andLP experiments some statistic information of dataset 2 iscollected and it is shown in Table 2 The numeric results in
12 International Journal of Distributed Sensor Networks
(i)
(ii)
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(ii)
(i)
(d)
(i)
(ii)
(e)
Figure 12 (a) (i) Spatial clustering on dataset 1 by using EM (ii) Spatial groups on dataset 1 from the results of using EM (b) (i) spatialclustering on dataset 1 by using KM (ii) Spatial groups on dataset 1 from the results of using KM (c) (i) spatial clustering on dataset 1 byusing HC (ii) Spatial groups on dataset 1 from the results of using HC (d) (i) spatial clustering on dataset 1 by using XM (ii) Spatial groupson dataset 1 from the results of using XM (e) (i) spatial clustering on dataset 1 by using DBScan (ii) Spatial group on dataset 1 from theresults of using DBScan
Table 3 support the qualitative analysis by visual inspectionin the previous section By comparingHC and LPmethods asan example the quantitative results show that they have thegreatest differences in cell numbers covered by the clustersalso the amount of overlap in HC is the highest of all By theLP method the size of each cluster is exactly the same andthey are totally free from overlap
6 Technical Analysis of Clustering Results
61 Experimental Evaluation Method For the purpose ofassessing how the qualities of spatial groups from clusteringare several evaluation factors are defined here running time(short for time) balance log-likelihood overlap densityand coverage For a fair comparison the datasets are run inthe same software environment of the same computer And
assume the number of groups to be five with six differentmethods Running time is the time we used to run eachmethod using the same software in the same computerto completion Balance is used to measure the sizes ofgroups if balanced the size of each group is the same Log-likelihood is an important measure for clustering qualitythe bigger the value the better Weka tests for goodness-of-fit by the likelihood in logarithm which is called log-likelihood A large log-likelihood means that the clusteringmodel is suitable for the data under test Overlap means thatthe spatial values (eg traffic volumes sensed by the sensors)do belong to more than one cluster Density is the averagespatial values (traffic volumes) per grid cell in each clusterCoverage of a cluster means the proportion of traffic volumesthat are covered by the grid cells within the cluster overthe whole dataset meanwhile total coverage is the sum of
International Journal of Distributed Sensor Networks 13
(i)
(ii)
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(i)
(ii)
(d)
(i)
(ii)
(e)
(f)
Figure 13 (a) (i) spatial clustering on dataset 2 by using EM (ii) Spatial groups on dataset 2 from the results of using EM (b) (i) spatialclustering on dataset 2 by using KM (ii) Spatial groups on dataset 2 from the results of using KM (c) (i) spatial clustering on dataset 2 byusing HC (ii) Spatial groups on dataset 2 from the results of using HC (d) (i) spatial clustering on dataset 2 by using XM (ii) Spatial groupson dataset 2 from the results of using XM (e) (i) spatial clustering on dataset 2 by using DBScan (ii) Spatial group on dataset 2 from theresults of using DBScan (f) Spatial group in LP method on dataset 2
14 International Journal of Distributed Sensor Networks
traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below
Density (cluster 119894) =sumTraffic Volumes (cluster 119894)
Grid Cell Number (cluster 119894)
Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)
sumGrid Cell Number
Total Coverage = sumTraffic Volumes minusOverlaps
Proportion of Cluster (119894) Size (Balance)
=Grid Cell Number (cluster 119894)
sumGrid Cell Number
(4)
62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively
According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest
In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one
In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14
According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On
0
100
200
300
400
500
600
700
800
900
0 20000 40000 60000 80000 100000
K-meansHierarchicalDBScanXMeanEM
Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP
K-means)
Figure 14 Comparison of running time (in seconds) of differentsizes of dataset
the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally
In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one
In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC
The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets
From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage
International Journal of Distributed Sensor Networks 15
Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2
Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700
0
01
02
03
04
05
06
07
08
KM EM DBScan XM HC LP
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
CoverageCoverage
(a)
0
01
02
03
04
05
06
KM EM DBScan XM HC LP
Coverage
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
Coverage
(b)
Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2
Total density
0
200
400
600
800
1000
1200
1400
1600
KM EM DBScan XM HC LP
Density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(a)
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
KM EM DBScan XM HC LP
Density
Total density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(b)
Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2
16 International Journal of Distributed Sensor Networks
4
5136
8
1
Balance test on dataset 1
(a) KM
1
50
1
18
30
Balance test on dataset 1
(b) XM
6
22
24
30
18
Balance test on dataset 1
(c) EM
24
24
17
20
15
Balance test on dataset 1
(d) DBScan
18
17
22
19
25
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP
From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means
that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered
International Journal of Distributed Sensor Networks 17
17
18
17
24
24
Balance test on dataset 2
(a) KM
24
18
24
18
17
Balance test on dataset 2
(b) XM
47
032
47
Balance test on dataset 2
(c) EM
1010
98
Balance test on dataset 2
(d) DBScan
23
15
24
18
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP
The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely
63 Discussion of G119899119890119905 For all the six evaluation factors each
of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in
18 International Journal of Distributed Sensor Networks
Table 10 Numeric results of density of each cluster by using the six methods for dataset 1
Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049
Table 11 Numeric results of density of each cluster by using the six methods for dataset 2
Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447
order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866
119888can take a relatively very large value or even
1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes
119866119897=
10038161003816100381610038161003816100381610038161003816
LikelihoodTime
10038161003816100381610038161003816100381610038161003816
(5)
119866119887=Difference of Balance
Time (6)
119866119889=DensityTime
(7)
119866119888=CoverageTime
(8)
119866119900=OverlapTime
(9)
119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)
Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)
From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in
Table 12 which allows us to easily compare various methodsand performance aspects
In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866
119897 119866119887 119866119889 119866119888 and 119866
119900are computed
for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13
According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion
7 Conclusion and Future Works
Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for
International Journal of Distributed Sensor Networks 19
Table 12 Performance indicators of the six methods based on dataset 2
Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0
Table 13 Comparison of different clustering and LP methods by119866net indicator
Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132
purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users
The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes
For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be
good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed
References
[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000
[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012
[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012
[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012
[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012
[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003
[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006
[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002
[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008
[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006
[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004
[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998
[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000
20 International Journal of Distributed Sensor Networks
[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004
[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005
[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010
[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998
[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006
[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999
[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001
[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009
[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999
[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000
[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996
[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963
[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
International Journal of Distributed Sensor Networks 11
Table 8 Numeric results of coverage of each cluster by using the six methods for dataset 1
Cov-db1 KM EM DBScan XM HC LPCluster 0 0029436 0003786 0017902 0075178 0013153 0028985Cluster 1 0301538 0269602 0208078 0049761 0026016 0377034Cluster 2 0215277 0001627 0158439 0084049 012436 0080099Cluster 3 0046788 0096221 0079177 0209390 0001172 0217204Cluster 4 0002712 0161799 0044197 0043152 03043 0007704Total coverage 0595751 0533036 0507793 0461531 0469 0711025
41 5
3 2
(a)
4
13
5
2
(b)
Figure 11 (a) Clustering results for the second dataset by usingKMmethod (b) Clustering results for the second dataset by usingHCmethod
result of first dataset is shown in Figure 12 The first part (i)of Figure 12 shows the spatial clustering result the secondpart (ii) visualizes the corresponding spatial groups by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method The centers of the clustersare computed after clustering is done and then the groupsare visualized over the clustering results according to thecomputed centers
In Figure 12 for the results of (a) and (e) the sizes ofclusters are quite uneven more than half of dataset fall intoone cluster Thus this result reveals a fact that the techniquecannot organize a dataset into homogeneous andor well-separated groups with respect to a distance or equivalentlya similarity measureThe corresponding groups have overlapphenomenon too For the result of (c) the sizes of the clustersare uneven too For the result of (b) and (d) the sizes of clusterseem to be similar to each other There is also no overlapin the clustering result but for group result the groups in(d) have far more overlaps than those in (b) Overlap meanssome part or the cluster gets in the way of another onewhichmeans that there is superposition between two ormoredifferent clusters Again itmay cause resourcewaste and evenfalse allocation This situation occurs in important fields ofapplications such as information retrieval (several thematicfor a single document) and biological data (several metabolicfunctions for one gene) For this reason (b) is better than (d)According to the above analysis for the result of clusteringand corresponding groups (d) XM is so far the best choiceof clustering algorithm as evidenced by the colored mapsthereafter
With the same experiment setup and operating environ-ment the spatial clustering experiments are performed overthe second dataset The results of second dataset are shown
in Figure 13 where (i) represents the spatial clustering resultand (ii) represents the corresponding spatial group by using(a) EM method (b) KM method (c) HC method (d) XMmethod and (e) DBScan method
In Figures 13(a) and 13(e) it is noticed that the clustersare imbalanced and there are overlaps in the correspondingspatial groups using the method of (a) EM and (e) DBScanThe results of (b) KM and (d) XM however avoid theshortcomings of (a) and (e) though they still have slightoverlaps For (c) HC we remove the empty cells in theboundary to reduce the size of dataset the clustering resultis perfect There is no overlap and clusters are balancedbetween each other But there is still overlap in the spatialgroups Thus LP method is adopted to solve this problemand in possession of same size of groups The result of LPmethod yields perfectly balanced groups without any overlapas shown in Figure 13(f)
By visually comparing the clustering results of the twodatasets the clustering results seem to be similar but thespatial groups are somewhat differentOccurrence of overlapsin spatial groups is more severe in the first dataset than in thesecond one The overlaps are likely due to data distributionand balance in sizes between each cluster For the first datasetthe heavy spatial values which are traffic volumes in this caseare mainly concentrated in the center region (city center) solocations of the computed clusters tend to cram very near ata crowded spot In contrast the traffic volumes in the seconddataset are dispersed over a large area As seen from the visualspatial groups of the second dataset the cluster positions area little far apart when compared to those in the first dataset
Based on the results generated from the clustering andLP experiments some statistic information of dataset 2 iscollected and it is shown in Table 2 The numeric results in
12 International Journal of Distributed Sensor Networks
(i)
(ii)
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(ii)
(i)
(d)
(i)
(ii)
(e)
Figure 12 (a) (i) Spatial clustering on dataset 1 by using EM (ii) Spatial groups on dataset 1 from the results of using EM (b) (i) spatialclustering on dataset 1 by using KM (ii) Spatial groups on dataset 1 from the results of using KM (c) (i) spatial clustering on dataset 1 byusing HC (ii) Spatial groups on dataset 1 from the results of using HC (d) (i) spatial clustering on dataset 1 by using XM (ii) Spatial groupson dataset 1 from the results of using XM (e) (i) spatial clustering on dataset 1 by using DBScan (ii) Spatial group on dataset 1 from theresults of using DBScan
Table 3 support the qualitative analysis by visual inspectionin the previous section By comparingHC and LPmethods asan example the quantitative results show that they have thegreatest differences in cell numbers covered by the clustersalso the amount of overlap in HC is the highest of all By theLP method the size of each cluster is exactly the same andthey are totally free from overlap
6 Technical Analysis of Clustering Results
61 Experimental Evaluation Method For the purpose ofassessing how the qualities of spatial groups from clusteringare several evaluation factors are defined here running time(short for time) balance log-likelihood overlap densityand coverage For a fair comparison the datasets are run inthe same software environment of the same computer And
assume the number of groups to be five with six differentmethods Running time is the time we used to run eachmethod using the same software in the same computerto completion Balance is used to measure the sizes ofgroups if balanced the size of each group is the same Log-likelihood is an important measure for clustering qualitythe bigger the value the better Weka tests for goodness-of-fit by the likelihood in logarithm which is called log-likelihood A large log-likelihood means that the clusteringmodel is suitable for the data under test Overlap means thatthe spatial values (eg traffic volumes sensed by the sensors)do belong to more than one cluster Density is the averagespatial values (traffic volumes) per grid cell in each clusterCoverage of a cluster means the proportion of traffic volumesthat are covered by the grid cells within the cluster overthe whole dataset meanwhile total coverage is the sum of
International Journal of Distributed Sensor Networks 13
(i)
(ii)
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(i)
(ii)
(d)
(i)
(ii)
(e)
(f)
Figure 13 (a) (i) spatial clustering on dataset 2 by using EM (ii) Spatial groups on dataset 2 from the results of using EM (b) (i) spatialclustering on dataset 2 by using KM (ii) Spatial groups on dataset 2 from the results of using KM (c) (i) spatial clustering on dataset 2 byusing HC (ii) Spatial groups on dataset 2 from the results of using HC (d) (i) spatial clustering on dataset 2 by using XM (ii) Spatial groupson dataset 2 from the results of using XM (e) (i) spatial clustering on dataset 2 by using DBScan (ii) Spatial group on dataset 2 from theresults of using DBScan (f) Spatial group in LP method on dataset 2
14 International Journal of Distributed Sensor Networks
traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below
Density (cluster 119894) =sumTraffic Volumes (cluster 119894)
Grid Cell Number (cluster 119894)
Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)
sumGrid Cell Number
Total Coverage = sumTraffic Volumes minusOverlaps
Proportion of Cluster (119894) Size (Balance)
=Grid Cell Number (cluster 119894)
sumGrid Cell Number
(4)
62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively
According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest
In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one
In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14
According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On
0
100
200
300
400
500
600
700
800
900
0 20000 40000 60000 80000 100000
K-meansHierarchicalDBScanXMeanEM
Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP
K-means)
Figure 14 Comparison of running time (in seconds) of differentsizes of dataset
the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally
In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one
In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC
The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets
From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage
International Journal of Distributed Sensor Networks 15
Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2
Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700
0
01
02
03
04
05
06
07
08
KM EM DBScan XM HC LP
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
CoverageCoverage
(a)
0
01
02
03
04
05
06
KM EM DBScan XM HC LP
Coverage
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
Coverage
(b)
Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2
Total density
0
200
400
600
800
1000
1200
1400
1600
KM EM DBScan XM HC LP
Density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(a)
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
KM EM DBScan XM HC LP
Density
Total density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(b)
Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2
16 International Journal of Distributed Sensor Networks
4
5136
8
1
Balance test on dataset 1
(a) KM
1
50
1
18
30
Balance test on dataset 1
(b) XM
6
22
24
30
18
Balance test on dataset 1
(c) EM
24
24
17
20
15
Balance test on dataset 1
(d) DBScan
18
17
22
19
25
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP
From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means
that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered
International Journal of Distributed Sensor Networks 17
17
18
17
24
24
Balance test on dataset 2
(a) KM
24
18
24
18
17
Balance test on dataset 2
(b) XM
47
032
47
Balance test on dataset 2
(c) EM
1010
98
Balance test on dataset 2
(d) DBScan
23
15
24
18
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP
The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely
63 Discussion of G119899119890119905 For all the six evaluation factors each
of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in
18 International Journal of Distributed Sensor Networks
Table 10 Numeric results of density of each cluster by using the six methods for dataset 1
Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049
Table 11 Numeric results of density of each cluster by using the six methods for dataset 2
Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447
order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866
119888can take a relatively very large value or even
1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes
119866119897=
10038161003816100381610038161003816100381610038161003816
LikelihoodTime
10038161003816100381610038161003816100381610038161003816
(5)
119866119887=Difference of Balance
Time (6)
119866119889=DensityTime
(7)
119866119888=CoverageTime
(8)
119866119900=OverlapTime
(9)
119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)
Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)
From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in
Table 12 which allows us to easily compare various methodsand performance aspects
In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866
119897 119866119887 119866119889 119866119888 and 119866
119900are computed
for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13
According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion
7 Conclusion and Future Works
Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for
International Journal of Distributed Sensor Networks 19
Table 12 Performance indicators of the six methods based on dataset 2
Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0
Table 13 Comparison of different clustering and LP methods by119866net indicator
Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132
purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users
The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes
For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be
good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed
References
[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000
[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012
[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012
[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012
[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012
[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003
[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006
[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002
[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008
[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006
[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004
[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998
[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000
20 International Journal of Distributed Sensor Networks
[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004
[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005
[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010
[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998
[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006
[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999
[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001
[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009
[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999
[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000
[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996
[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963
[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
12 International Journal of Distributed Sensor Networks
(i)
(ii)
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(ii)
(i)
(d)
(i)
(ii)
(e)
Figure 12 (a) (i) Spatial clustering on dataset 1 by using EM (ii) Spatial groups on dataset 1 from the results of using EM (b) (i) spatialclustering on dataset 1 by using KM (ii) Spatial groups on dataset 1 from the results of using KM (c) (i) spatial clustering on dataset 1 byusing HC (ii) Spatial groups on dataset 1 from the results of using HC (d) (i) spatial clustering on dataset 1 by using XM (ii) Spatial groupson dataset 1 from the results of using XM (e) (i) spatial clustering on dataset 1 by using DBScan (ii) Spatial group on dataset 1 from theresults of using DBScan
Table 3 support the qualitative analysis by visual inspectionin the previous section By comparingHC and LPmethods asan example the quantitative results show that they have thegreatest differences in cell numbers covered by the clustersalso the amount of overlap in HC is the highest of all By theLP method the size of each cluster is exactly the same andthey are totally free from overlap
6 Technical Analysis of Clustering Results
61 Experimental Evaluation Method For the purpose ofassessing how the qualities of spatial groups from clusteringare several evaluation factors are defined here running time(short for time) balance log-likelihood overlap densityand coverage For a fair comparison the datasets are run inthe same software environment of the same computer And
assume the number of groups to be five with six differentmethods Running time is the time we used to run eachmethod using the same software in the same computerto completion Balance is used to measure the sizes ofgroups if balanced the size of each group is the same Log-likelihood is an important measure for clustering qualitythe bigger the value the better Weka tests for goodness-of-fit by the likelihood in logarithm which is called log-likelihood A large log-likelihood means that the clusteringmodel is suitable for the data under test Overlap means thatthe spatial values (eg traffic volumes sensed by the sensors)do belong to more than one cluster Density is the averagespatial values (traffic volumes) per grid cell in each clusterCoverage of a cluster means the proportion of traffic volumesthat are covered by the grid cells within the cluster overthe whole dataset meanwhile total coverage is the sum of
International Journal of Distributed Sensor Networks 13
(i)
(ii)
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(i)
(ii)
(d)
(i)
(ii)
(e)
(f)
Figure 13 (a) (i) spatial clustering on dataset 2 by using EM (ii) Spatial groups on dataset 2 from the results of using EM (b) (i) spatialclustering on dataset 2 by using KM (ii) Spatial groups on dataset 2 from the results of using KM (c) (i) spatial clustering on dataset 2 byusing HC (ii) Spatial groups on dataset 2 from the results of using HC (d) (i) spatial clustering on dataset 2 by using XM (ii) Spatial groupson dataset 2 from the results of using XM (e) (i) spatial clustering on dataset 2 by using DBScan (ii) Spatial group on dataset 2 from theresults of using DBScan (f) Spatial group in LP method on dataset 2
14 International Journal of Distributed Sensor Networks
traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below
Density (cluster 119894) =sumTraffic Volumes (cluster 119894)
Grid Cell Number (cluster 119894)
Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)
sumGrid Cell Number
Total Coverage = sumTraffic Volumes minusOverlaps
Proportion of Cluster (119894) Size (Balance)
=Grid Cell Number (cluster 119894)
sumGrid Cell Number
(4)
62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively
According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest
In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one
In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14
According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On
0
100
200
300
400
500
600
700
800
900
0 20000 40000 60000 80000 100000
K-meansHierarchicalDBScanXMeanEM
Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP
K-means)
Figure 14 Comparison of running time (in seconds) of differentsizes of dataset
the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally
In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one
In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC
The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets
From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage
International Journal of Distributed Sensor Networks 15
Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2
Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700
0
01
02
03
04
05
06
07
08
KM EM DBScan XM HC LP
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
CoverageCoverage
(a)
0
01
02
03
04
05
06
KM EM DBScan XM HC LP
Coverage
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
Coverage
(b)
Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2
Total density
0
200
400
600
800
1000
1200
1400
1600
KM EM DBScan XM HC LP
Density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(a)
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
KM EM DBScan XM HC LP
Density
Total density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(b)
Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2
16 International Journal of Distributed Sensor Networks
4
5136
8
1
Balance test on dataset 1
(a) KM
1
50
1
18
30
Balance test on dataset 1
(b) XM
6
22
24
30
18
Balance test on dataset 1
(c) EM
24
24
17
20
15
Balance test on dataset 1
(d) DBScan
18
17
22
19
25
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP
From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means
that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered
International Journal of Distributed Sensor Networks 17
17
18
17
24
24
Balance test on dataset 2
(a) KM
24
18
24
18
17
Balance test on dataset 2
(b) XM
47
032
47
Balance test on dataset 2
(c) EM
1010
98
Balance test on dataset 2
(d) DBScan
23
15
24
18
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP
The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely
63 Discussion of G119899119890119905 For all the six evaluation factors each
of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in
18 International Journal of Distributed Sensor Networks
Table 10 Numeric results of density of each cluster by using the six methods for dataset 1
Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049
Table 11 Numeric results of density of each cluster by using the six methods for dataset 2
Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447
order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866
119888can take a relatively very large value or even
1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes
119866119897=
10038161003816100381610038161003816100381610038161003816
LikelihoodTime
10038161003816100381610038161003816100381610038161003816
(5)
119866119887=Difference of Balance
Time (6)
119866119889=DensityTime
(7)
119866119888=CoverageTime
(8)
119866119900=OverlapTime
(9)
119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)
Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)
From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in
Table 12 which allows us to easily compare various methodsand performance aspects
In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866
119897 119866119887 119866119889 119866119888 and 119866
119900are computed
for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13
According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion
7 Conclusion and Future Works
Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for
International Journal of Distributed Sensor Networks 19
Table 12 Performance indicators of the six methods based on dataset 2
Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0
Table 13 Comparison of different clustering and LP methods by119866net indicator
Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132
purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users
The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes
For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be
good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed
References
[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000
[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012
[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012
[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012
[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012
[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003
[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006
[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002
[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008
[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006
[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004
[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998
[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000
20 International Journal of Distributed Sensor Networks
[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004
[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005
[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010
[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998
[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006
[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999
[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001
[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009
[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999
[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000
[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996
[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963
[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
International Journal of Distributed Sensor Networks 13
(i)
(ii)
(a)
(i)
(ii)
(b)
(i)
(ii)
(c)
(i)
(ii)
(d)
(i)
(ii)
(e)
(f)
Figure 13 (a) (i) spatial clustering on dataset 2 by using EM (ii) Spatial groups on dataset 2 from the results of using EM (b) (i) spatialclustering on dataset 2 by using KM (ii) Spatial groups on dataset 2 from the results of using KM (c) (i) spatial clustering on dataset 2 byusing HC (ii) Spatial groups on dataset 2 from the results of using HC (d) (i) spatial clustering on dataset 2 by using XM (ii) Spatial groupson dataset 2 from the results of using XM (e) (i) spatial clustering on dataset 2 by using DBScan (ii) Spatial group on dataset 2 from theresults of using DBScan (f) Spatial group in LP method on dataset 2
14 International Journal of Distributed Sensor Networks
traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below
Density (cluster 119894) =sumTraffic Volumes (cluster 119894)
Grid Cell Number (cluster 119894)
Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)
sumGrid Cell Number
Total Coverage = sumTraffic Volumes minusOverlaps
Proportion of Cluster (119894) Size (Balance)
=Grid Cell Number (cluster 119894)
sumGrid Cell Number
(4)
62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively
According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest
In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one
In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14
According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On
0
100
200
300
400
500
600
700
800
900
0 20000 40000 60000 80000 100000
K-meansHierarchicalDBScanXMeanEM
Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP
K-means)
Figure 14 Comparison of running time (in seconds) of differentsizes of dataset
the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally
In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one
In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC
The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets
From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage
International Journal of Distributed Sensor Networks 15
Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2
Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700
0
01
02
03
04
05
06
07
08
KM EM DBScan XM HC LP
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
CoverageCoverage
(a)
0
01
02
03
04
05
06
KM EM DBScan XM HC LP
Coverage
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
Coverage
(b)
Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2
Total density
0
200
400
600
800
1000
1200
1400
1600
KM EM DBScan XM HC LP
Density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(a)
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
KM EM DBScan XM HC LP
Density
Total density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(b)
Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2
16 International Journal of Distributed Sensor Networks
4
5136
8
1
Balance test on dataset 1
(a) KM
1
50
1
18
30
Balance test on dataset 1
(b) XM
6
22
24
30
18
Balance test on dataset 1
(c) EM
24
24
17
20
15
Balance test on dataset 1
(d) DBScan
18
17
22
19
25
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP
From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means
that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered
International Journal of Distributed Sensor Networks 17
17
18
17
24
24
Balance test on dataset 2
(a) KM
24
18
24
18
17
Balance test on dataset 2
(b) XM
47
032
47
Balance test on dataset 2
(c) EM
1010
98
Balance test on dataset 2
(d) DBScan
23
15
24
18
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP
The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely
63 Discussion of G119899119890119905 For all the six evaluation factors each
of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in
18 International Journal of Distributed Sensor Networks
Table 10 Numeric results of density of each cluster by using the six methods for dataset 1
Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049
Table 11 Numeric results of density of each cluster by using the six methods for dataset 2
Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447
order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866
119888can take a relatively very large value or even
1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes
119866119897=
10038161003816100381610038161003816100381610038161003816
LikelihoodTime
10038161003816100381610038161003816100381610038161003816
(5)
119866119887=Difference of Balance
Time (6)
119866119889=DensityTime
(7)
119866119888=CoverageTime
(8)
119866119900=OverlapTime
(9)
119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)
Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)
From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in
Table 12 which allows us to easily compare various methodsand performance aspects
In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866
119897 119866119887 119866119889 119866119888 and 119866
119900are computed
for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13
According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion
7 Conclusion and Future Works
Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for
International Journal of Distributed Sensor Networks 19
Table 12 Performance indicators of the six methods based on dataset 2
Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0
Table 13 Comparison of different clustering and LP methods by119866net indicator
Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132
purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users
The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes
For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be
good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed
References
[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000
[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012
[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012
[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012
[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012
[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003
[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006
[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002
[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008
[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006
[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004
[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998
[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000
20 International Journal of Distributed Sensor Networks
[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004
[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005
[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010
[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998
[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006
[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999
[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001
[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009
[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999
[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000
[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996
[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963
[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
14 International Journal of Distributed Sensor Networks
traffic volumes that are covered by all the clusters minus theoverlap if anyThe corresponding definitions are shown in theequations below
Density (cluster 119894) =sumTraffic Volumes (cluster 119894)
Grid Cell Number (cluster 119894)
Coverage (cluster 119894) =sumTraffic Volumes (cluster 119894)
sumGrid Cell Number
Total Coverage = sumTraffic Volumes minusOverlaps
Proportion of Cluster (119894) Size (Balance)
=Grid Cell Number (cluster 119894)
sumGrid Cell Number
(4)
62 Comparison Experimental Result After conducting anumber of experiment runs we select four different formatsof datasets to perform the clustering algorithm for the firstdataset Vector (119899 V) represents sequence 119899 and traffic volumeV Raster (119909 119910 V) represents coordinates (119909 119910) and trafficvolume V RasterP (16 grids) means every four neighborhoodcells over a grid merged into a single unit and RasterP(25 grids) means every five neighborhood cells over a gridmerged as one In the other two types of formats the datainformation is straightforwardly laid on a grid and somenoises such as outlier values are eliminated from the gridWe selected grids of sizes 16 and 25 for the two formatsThe original datasets are then encoded by the four differentdata formatting types The four formatted data are subject tothe five clustering methods and LP method We measure thecorresponding running time and log-likelihood The resultsof the two measurements are shown in Tables 3 and 4respectively
According to Table 3 we can see that KM spent the leastrunning time for the four different kinds of data but the run-time of RasterP (25 grids) dataset is the fastest Contrariwiseclustering of vector dataset using DBScan method spent thelongest running time Among the clustering methods KMspent the least time for different datasets and DBScan tookthe longest
In Table 4 we evaluate the log-likelihood of the clustersfound by each cluster which is a main evaluation metric forensuring quantitatively the quality of the clusters From thistable we can see that the value of log-likelihood of the fivemethods is quite similar Among them clustering of Rasterdataset using HC method is the best one but clustering ofRasterP (25 grids) using DBScan is the worst one
In the same experimental environment the running timeand log-likelihood are shown in Tables 5 and 6 for the seconddataset And in order to stressfully test the performance weelongate the dataset to larger sizes by expanding the datamap via duplication Running time trends are thereforeproduced the result is shown in Table 7 and correspondingtrend line is shown in Figure 14
According to Table 5 we can see that KM spent theshortest running time for the four different formats of databut the time of RasterP (25 grids) dataset is the fastest whichis expected because it abstracts every 25 cells into one On
0
100
200
300
400
500
600
700
800
900
0 20000 40000 60000 80000 100000
K-meansHierarchicalDBScanXMeanEM
Exp (Exp (Hierarchical)Exp (DBScan)Exp (XMean)Exp (EM)Exp (LP)LP
K-means)
Figure 14 Comparison of running time (in seconds) of differentsizes of dataset
the other hand clustering of Raster dataset using DBScanmethod spent the most running time For the different sixmethods KM spent the shortest time for different datasetsand DBScan spent the longest time generally
In Table 6 we can see that the values of log-likelihoodof different six methods are quite similar Among themclustering of Raster dataset using HCmethod is the best onebut clustering of RasterP (25 grids) usingKM is theworst one
In Table 7 we can see that the slowest is DBScan andthe quickest is KM method In terms of time trend DBScanincreases in larger magnitude of time consumption thanother methods but time trends of LP KM and XM are oflower gradients In particular there is an intersection betweenthe trend lines of HC and EM It means that when the size ofdataset exceeds that amount at the intersection EM methodbecomes a better choice than HC
The following charts and tables present the other techni-cal indicators such as coverage density and balance of eachcluster for the two datasets
From Figure 15 we can see that one cluster of DBScandominates the biggest coverage in all clusters as results fromthe sixmethods in the first dataset But for the second datasetLP method yields the biggest coverage cluster Generally theindividual coverage of each cluster in the second dataset isapparently larger than those resulted from the first dataset(Tables 8 and 9)Thismeans that the second dataset is suitablefor achieving spatial groups with the six methods due to itseven data distribution In terms of total coverage LP achievesthe highest values in both cases of datasets In summary LPis by far an effective method to determine spatial groups withthe best coverage
International Journal of Distributed Sensor Networks 15
Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2
Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700
0
01
02
03
04
05
06
07
08
KM EM DBScan XM HC LP
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
CoverageCoverage
(a)
0
01
02
03
04
05
06
KM EM DBScan XM HC LP
Coverage
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
Coverage
(b)
Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2
Total density
0
200
400
600
800
1000
1200
1400
1600
KM EM DBScan XM HC LP
Density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(a)
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
KM EM DBScan XM HC LP
Density
Total density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(b)
Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2
16 International Journal of Distributed Sensor Networks
4
5136
8
1
Balance test on dataset 1
(a) KM
1
50
1
18
30
Balance test on dataset 1
(b) XM
6
22
24
30
18
Balance test on dataset 1
(c) EM
24
24
17
20
15
Balance test on dataset 1
(d) DBScan
18
17
22
19
25
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP
From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means
that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered
International Journal of Distributed Sensor Networks 17
17
18
17
24
24
Balance test on dataset 2
(a) KM
24
18
24
18
17
Balance test on dataset 2
(b) XM
47
032
47
Balance test on dataset 2
(c) EM
1010
98
Balance test on dataset 2
(d) DBScan
23
15
24
18
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP
The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely
63 Discussion of G119899119890119905 For all the six evaluation factors each
of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in
18 International Journal of Distributed Sensor Networks
Table 10 Numeric results of density of each cluster by using the six methods for dataset 1
Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049
Table 11 Numeric results of density of each cluster by using the six methods for dataset 2
Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447
order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866
119888can take a relatively very large value or even
1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes
119866119897=
10038161003816100381610038161003816100381610038161003816
LikelihoodTime
10038161003816100381610038161003816100381610038161003816
(5)
119866119887=Difference of Balance
Time (6)
119866119889=DensityTime
(7)
119866119888=CoverageTime
(8)
119866119900=OverlapTime
(9)
119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)
Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)
From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in
Table 12 which allows us to easily compare various methodsand performance aspects
In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866
119897 119866119887 119866119889 119866119888 and 119866
119900are computed
for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13
According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion
7 Conclusion and Future Works
Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for
International Journal of Distributed Sensor Networks 19
Table 12 Performance indicators of the six methods based on dataset 2
Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0
Table 13 Comparison of different clustering and LP methods by119866net indicator
Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132
purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users
The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes
For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be
good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed
References
[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000
[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012
[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012
[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012
[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012
[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003
[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006
[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002
[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008
[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006
[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004
[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998
[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000
20 International Journal of Distributed Sensor Networks
[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004
[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005
[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010
[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998
[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006
[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999
[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001
[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009
[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999
[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000
[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996
[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963
[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
International Journal of Distributed Sensor Networks 15
Table 9 Numeric results of coverage of each cluster by using the six methods for dataset 2
Cov-db2 KM EM DBScan XM HC LPCluster 0 0042721 0001777 0450720 0022150 0013153 0165305Cluster 1 0094175 0086211 0008018 0010064 0026016 0127705Cluster 2 0328026 0032893 0010517 0126953 0124360 0095597Cluster 3 0022797 0351221 0000501 0311761 0001172 0089008Cluster 4 0062281 0101199 0000244 0112973 0304300 0122085Total coverage 0550000 0573301 0470000 0583900 0469000 0599700
0
01
02
03
04
05
06
07
08
KM EM DBScan XM HC LP
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
CoverageCoverage
(a)
0
01
02
03
04
05
06
KM EM DBScan XM HC LP
Coverage
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4Total coverage
Coverage
(b)
Figure 15 (a) Coverage of each cluster by using the sixmethods for dataset 1 (b) Coverage of each cluster by using the sixmethods for dataset2
Total density
0
200
400
600
800
1000
1200
1400
1600
KM EM DBScan XM HC LP
Density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(a)
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
KM EM DBScan XM HC LP
Density
Total density
Cluster 0Cluster 1Cluster 2
Cluster 3Cluster 4
Density
(b)
Figure 16 (a) Density of each cluster by using the six methods for dataset 1 (b) Density of each cluster by using the six methods for dataset2
16 International Journal of Distributed Sensor Networks
4
5136
8
1
Balance test on dataset 1
(a) KM
1
50
1
18
30
Balance test on dataset 1
(b) XM
6
22
24
30
18
Balance test on dataset 1
(c) EM
24
24
17
20
15
Balance test on dataset 1
(d) DBScan
18
17
22
19
25
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP
From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means
that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered
International Journal of Distributed Sensor Networks 17
17
18
17
24
24
Balance test on dataset 2
(a) KM
24
18
24
18
17
Balance test on dataset 2
(b) XM
47
032
47
Balance test on dataset 2
(c) EM
1010
98
Balance test on dataset 2
(d) DBScan
23
15
24
18
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP
The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely
63 Discussion of G119899119890119905 For all the six evaluation factors each
of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in
18 International Journal of Distributed Sensor Networks
Table 10 Numeric results of density of each cluster by using the six methods for dataset 1
Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049
Table 11 Numeric results of density of each cluster by using the six methods for dataset 2
Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447
order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866
119888can take a relatively very large value or even
1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes
119866119897=
10038161003816100381610038161003816100381610038161003816
LikelihoodTime
10038161003816100381610038161003816100381610038161003816
(5)
119866119887=Difference of Balance
Time (6)
119866119889=DensityTime
(7)
119866119888=CoverageTime
(8)
119866119900=OverlapTime
(9)
119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)
Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)
From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in
Table 12 which allows us to easily compare various methodsand performance aspects
In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866
119897 119866119887 119866119889 119866119888 and 119866
119900are computed
for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13
According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion
7 Conclusion and Future Works
Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for
International Journal of Distributed Sensor Networks 19
Table 12 Performance indicators of the six methods based on dataset 2
Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0
Table 13 Comparison of different clustering and LP methods by119866net indicator
Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132
purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users
The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes
For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be
good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed
References
[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000
[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012
[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012
[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012
[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012
[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003
[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006
[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002
[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008
[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006
[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004
[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998
[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000
20 International Journal of Distributed Sensor Networks
[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004
[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005
[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010
[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998
[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006
[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999
[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001
[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009
[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999
[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000
[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996
[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963
[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
16 International Journal of Distributed Sensor Networks
4
5136
8
1
Balance test on dataset 1
(a) KM
1
50
1
18
30
Balance test on dataset 1
(b) XM
6
22
24
30
18
Balance test on dataset 1
(c) EM
24
24
17
20
15
Balance test on dataset 1
(d) DBScan
18
17
22
19
25
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 1
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 17 Proportions of cluster sizes (balance) of dataset 1 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC and (f) LP
From Figure 16(a) we can see that one cluster of EMoccupies the biggest density in all clusters of the six methodsin the first dataset But the LPmethod obtains the largest totaldensity evenly from all the clusters Generally the individualdensity of each cluster in the second dataset is much biggerthan that of the first dataset (Tables 10 and 11) Again it means
that the second dataset has an even data distribution that issuitable for achieving spatial groups with high density Andin terms of total density EM is the best performer in the firstdataset but DBScan achieves the best results in the seconddataset DBScan has an advantage of merging scattered datainto density groups as long as the data are well scattered
International Journal of Distributed Sensor Networks 17
17
18
17
24
24
Balance test on dataset 2
(a) KM
24
18
24
18
17
Balance test on dataset 2
(b) XM
47
032
47
Balance test on dataset 2
(c) EM
1010
98
Balance test on dataset 2
(d) DBScan
23
15
24
18
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP
The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely
63 Discussion of G119899119890119905 For all the six evaluation factors each
of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in
18 International Journal of Distributed Sensor Networks
Table 10 Numeric results of density of each cluster by using the six methods for dataset 1
Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049
Table 11 Numeric results of density of each cluster by using the six methods for dataset 2
Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447
order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866
119888can take a relatively very large value or even
1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes
119866119897=
10038161003816100381610038161003816100381610038161003816
LikelihoodTime
10038161003816100381610038161003816100381610038161003816
(5)
119866119887=Difference of Balance
Time (6)
119866119889=DensityTime
(7)
119866119888=CoverageTime
(8)
119866119900=OverlapTime
(9)
119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)
Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)
From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in
Table 12 which allows us to easily compare various methodsand performance aspects
In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866
119897 119866119887 119866119889 119866119888 and 119866
119900are computed
for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13
According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion
7 Conclusion and Future Works
Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for
International Journal of Distributed Sensor Networks 19
Table 12 Performance indicators of the six methods based on dataset 2
Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0
Table 13 Comparison of different clustering and LP methods by119866net indicator
Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132
purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users
The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes
For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be
good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed
References
[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000
[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012
[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012
[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012
[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012
[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003
[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006
[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002
[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008
[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006
[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004
[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998
[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000
20 International Journal of Distributed Sensor Networks
[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004
[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005
[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010
[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998
[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006
[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999
[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001
[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009
[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999
[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000
[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996
[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963
[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
International Journal of Distributed Sensor Networks 17
17
18
17
24
24
Balance test on dataset 2
(a) KM
24
18
24
18
17
Balance test on dataset 2
(b) XM
47
032
47
Balance test on dataset 2
(c) EM
1010
98
Balance test on dataset 2
(d) DBScan
23
15
24
18
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(e) HC
20
20
20
20
20
Balance test on dataset 2
Cluster 1Cluster 2Cluster 3
Cluster 4Cluster 5
(f) LP
Figure 18 Proportions of Cluster Sizes (Balance) of dataset 2 in by using (a) KM (b) XM (c) EM (d) DBScan (e) HC (f) LP
The last evaluation factor is balance the results areshown in Figures 17 and 18 For both datasets only LPmethod can achieve absolute balance for spatial groups com-pletely
63 Discussion of G119899119890119905 For all the six evaluation factors each
of them can be an individual measure to decide whethera method is good or not in certain aspect In general thefollowing indicators (from (5) to (11)) have been defined in
18 International Journal of Distributed Sensor Networks
Table 10 Numeric results of density of each cluster by using the six methods for dataset 1
Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049
Table 11 Numeric results of density of each cluster by using the six methods for dataset 2
Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447
order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866
119888can take a relatively very large value or even
1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes
119866119897=
10038161003816100381610038161003816100381610038161003816
LikelihoodTime
10038161003816100381610038161003816100381610038161003816
(5)
119866119887=Difference of Balance
Time (6)
119866119889=DensityTime
(7)
119866119888=CoverageTime
(8)
119866119900=OverlapTime
(9)
119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)
Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)
From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in
Table 12 which allows us to easily compare various methodsand performance aspects
In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866
119897 119866119887 119866119889 119866119888 and 119866
119900are computed
for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13
According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion
7 Conclusion and Future Works
Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for
International Journal of Distributed Sensor Networks 19
Table 12 Performance indicators of the six methods based on dataset 2
Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0
Table 13 Comparison of different clustering and LP methods by119866net indicator
Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132
purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users
The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes
For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be
good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed
References
[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000
[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012
[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012
[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012
[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012
[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003
[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006
[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002
[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008
[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006
[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004
[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998
[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000
20 International Journal of Distributed Sensor Networks
[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004
[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005
[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010
[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998
[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006
[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999
[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001
[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009
[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999
[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000
[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996
[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963
[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
18 International Journal of Distributed Sensor Networks
Table 10 Numeric results of density of each cluster by using the six methods for dataset 1
Density KM EM DBScan XM HC LPCluster 0 5258648 0080823 4426289 3431892 2713810 1677869Cluster 1 1161390 2329182 0994949 1375497 3501739 1296230Cluster 2 7186556 2545750 0807500 1218667 2728017 9703279Cluster 3 2572683 1232386 1062069 5171040 4265905 9034426Cluster 4 5969350 142054 0170455 1510576 4088438 1239180Total density 1204343 1400359 4729787 1146972 1030703 6087049
Table 11 Numeric results of density of each cluster by using the six methods for dataset 2
Density KM XM EM DBScan HC LPCluster 0 1925445 2476642081 396813638 1972394643 5323785326 331318Cluster 1 1972395 1763496208 1502698729 1972394643 2140482869 166788Cluster 2 1408149 106489095 1629795665 1437189548 1823821619 8097989Cluster 3 3060449 6293956697 2015105986 1636350955 79912225 2474492Cluster 4 1773937 1058346213 1275299493 1212317249 6856982634 156958Total density 3896873 3486653421 6819713511 8230647036 5981503534 5440447
order to evaluate which method is an appropriate choicewhen it comes to different datasets and different usersrsquorequirements Among them the difference in balance iscontributed by the difference of grid cell number in eachcluster Meanwhile we assign each of them a proportionalweight 120596 to adjust the evaluation result 119866net The 120596 value isto be tuned by the users depending on their interests Forexample if a verywide coverage is of priority and others are ofless concern 119866
119888can take a relatively very large value or even
1 If users consider that some attributes are more importantthe corresponding weights 120596 for some factors can be largerthan the others Overall 119866net which is the sum of all factorsmultiplied by the corresponding performance indicators is anet indicator signifying how good a clustering process is byconsidering all the performance attributes
119866119897=
10038161003816100381610038161003816100381610038161003816
LikelihoodTime
10038161003816100381610038161003816100381610038161003816
(5)
119866119887=Difference of Balance
Time (6)
119866119889=DensityTime
(7)
119866119888=CoverageTime
(8)
119866119900=OverlapTime
(9)
119866net = 120596119897119866119897+ 120596119889119866119887+ 120596119889lowast 119866119889+ 120596119888119866119888+ 120596119900119866119900 (10)
Constraint 120596119897+ 120596119889+ 120596119887+ 120596119888+ 120596119900= 1 (11)
From the results of spatial grouping as experimented inthe previous sections we obtain some statistic informationon each group based on the second dataset as a range ofindicators depicted from (5) to (11) They are shown in
Table 12 which allows us to easily compare various methodsand performance aspects
In Table 12 KM method has the best run time and nooverlap For XMmethod DBScan and HC demonstrate theiradvantage in density and log-likelihood Nevertheless LPmethod is superior in three aspects coverage no overlapand zero difference of balance with other clusters In orderto further verify the correctness of the above analysis theperformance indicators 119866
119897 119866119887 119866119889 119866119888 and 119866
119900are computed
for obtaining the net performance values119866net assuming equalweights for each method For the sake of easy comparison119866net is normalized by first setting the lowest 119866net amongthe six methods as base value 1 then the 119866net for the othermethods is scaled up accordingly The comparison result isshown in Table 13
According to the experiment results conducted so farLP seems to be the best candidate in almost all the aspectssuch as coverage and balance This is tested across differentdatasets different formats and different sizes of datasetHowever for density and log-likelihood the result is not soconsistent as LP would be outperformed byDBScan at timesFinally by the net result of 119866net LP is a better choice underthe overall consideration of the six performance factorsThe choice of weights which imply priorities or preferenceson the performance aspects should be chosen by the userrsquosdiscretion
7 Conclusion and Future Works
Ubiquitous sensor network generated data that inherentlyhave spatial information When they are viewed afar thelocalizations of the data form some densities spatially dis-tributed over a terrain and the collected data from thesensors indicate how important the values are in their localproximity Given this information the users of the sensornetwork may subsequently want to form spatial clusters for
International Journal of Distributed Sensor Networks 19
Table 12 Performance indicators of the six methods based on dataset 2
Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0
Table 13 Comparison of different clustering and LP methods by119866net indicator
Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132
purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users
The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes
For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be
good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed
References
[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000
[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012
[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012
[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012
[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012
[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003
[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006
[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002
[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008
[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006
[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004
[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998
[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000
20 International Journal of Distributed Sensor Networks
[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004
[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005
[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010
[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998
[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006
[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999
[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001
[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009
[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999
[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000
[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996
[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963
[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
International Journal of Distributed Sensor Networks 19
Table 12 Performance indicators of the six methods based on dataset 2
Method Coverage Density Time Log-likelihood Overlap Diff of balanceKM 0595751 3896873 041 minus1735 No 190XM 0533037 3486653 067 minus1722 No 185EM 0507794 6819714 123 minus1657 Yes 1216DBScan 0461531 8230647 1567 minus1754 Yes 2517HC 0677124 5981504 1478 minus2013 Yes 103LP 0711025 5440447 776 NA No 0
Table 13 Comparison of different clustering and LP methods by119866net indicator
Methods KM XM EM DBScan HC LP119866net 108 115 111 123 100 132
purposes such as resource allocation distribution evalua-tions or summing up the geographical data into groups Thefocus of this study was to design efficient methods to identifysuch optimal spatial groups that have certain sizes andpositions using clustering algorithms or the equivalent forobtaining maximum total coverage in total Some examplesinclude but are not limited to setting up mobile phonebase stations among an even distribution of mobile phoneusers each may have different demand in usage distributedsensors that monitor the traffic volumes over a city andsecurity patrols in an exhibition where the asset values tobe protected vary and are distributed over a large area Thestudy also investigated whether spatial groups identified byusing different methods are sufficiently efficient for achievingoptimal maximum coverage Five classic spatial groupingalgorithms are discussed and compared in this study by usingdata mining software programsThe identified spatial groupswith different values of data resources were then assessedvia six performance factors Weights were also formulated asfactor coefficients The factors adopted were shown to playa significant role in MAUT (multiattribute utilities theory)The performance under proper factors and weights may varyas the factors could be arbitrarily chosen by users
The spatial groups obtained by classic clustering algo-rithms have some limits such as overlaps It may causeresource being wasted and even false grouping Howeverthere has been no study reported in the literature that theauthors are aware of using linear programming (LP) methodto discover spatial groups and to overcome this limit ofoverlappingThus in this research we implemented this newmethod (LP) to obtain spatial groups for yielding maximumcoverage and completely avoiding overlap A rigorous evalu-ation was used to assess the grouping results by consideringmultiple attributes
For future extended study we want to further enhancethe algorithm such as combining LP method with existingspatial group algorithms to achieve new hybrid algorithmSome clustering algorithms (eg KM) are known to convergequickly and LP though not the quickest it is efficient infinding the optimal groupings without any overlap It will be
good if the advantages from one algorithm to ride over theothers in the new fusion algorithms are to be developed
References
[1] G J Pottie and W J Kaiser ldquoWireless integrated network sen-sorsrdquo Communications of the ACM vol 43 no 5 pp 51ndash582000
[2] K H Eom M C Kim S J Lee and C W Lee ldquoThe vegetablefreshness monitoring system using RFID with oxygen andcarbon dioxide sensorrdquo International Journal of DistributedSensor Networks vol 2012 Article ID 472986 6 pages 2012
[3] G Manes G Collodi R Fusco L Gelpi and A Manes ldquoAwireless sensor network for precise volatile organic compoundmonitoringrdquo International Journal of Distributed Sensor Net-works vol 2012 Article ID 820716 13 pages 2012
[4] Y-G Ha H Kim and Y-C Byun ldquoEnergy-efficient fire mon-itoring over cluster-based wireless sensor networksrdquo Interna-tional Journal of Distributed Sensor Networks vol 2012 ArticleID 460754 11 pages 2012
[5] A Wahid and D Kim ldquoAn energy efficient localization-freerouting protocol for underwater wireless sensor networksrdquoInternational Journal of Distributed Sensor Networks vol 2012Article ID 307246 11 pages 2012
[6] T N Tran R Wehrens and L M C Buydens ldquoSpaRef a clus-tering algorithm for multispectral imagesrdquo Analytica Chimi-ca Acta vol 490 no 1-2 pp 303ndash312 2003
[7] G Ayala I Epifanio A Simo and V Zapater ldquoClusteringof spatial point patternsrdquo Computational Statistics and DataAnalysis vol 50 no 4 pp 1016ndash1032 2006
[8] J Domingo G Ayala and M E Dıaz ldquoMorphometric analysisof human corneal endothelium by means of spatial point pat-ternsrdquo International Journal of Pattern Recognition and ArtificialIntelligence vol 16 no 2 pp 127ndash143 2002
[9] E Demir C Aykanat and B Barla Cambazoglu ldquoClusteringspatial networks for aggregate query processing a hypergraphapproachrdquo Information Systems vol 33 no 1 pp 1ndash17 2008
[10] T Hu and S Y Sung ldquoA hybrid EM approach to spatial clus-teringrdquo Computational Statistics and Data Analysis vol 50 no5 pp 1188ndash1205 2006
[11] G Lin ldquoComparing spatial clustering tests based on rare tocommon spatial eventsrdquo Computers Environment and UrbanSystems vol 28 no 6 pp 691ndash699 2004
[12] M Ester and H-P Kriegel ldquoClustering for mining in largespatial databases [Special Issue on Data Mining]rdquo KI-Journalvol 1 pp 332ndash338 1998
[13] J Han M Kamber and A K H Tung ldquoSpatial clusteringmethods in data mining a surveyrdquo Tech Rep ComputerScience Simon Fraster University 2000
20 International Journal of Distributed Sensor Networks
[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004
[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005
[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010
[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998
[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006
[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999
[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001
[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009
[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999
[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000
[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996
[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963
[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
20 International Journal of Distributed Sensor Networks
[14] H-D Yang and F-Q Deng ldquoThe study on immune spatialclustering model based on obstaclerdquo in Proceedings of theInternational Conference on Machine Learning and Cyberneticsvol 2 pp 1214ndash1219 August 2004
[15] T-S Chen T-H Tsai Y-T Chen et al ldquoA combined K-meansand hierarchical clusteringmethod for improving the clusteringefficiency of microarrayrdquo in Proceedings of the InternationalSymposium on Intelligent Signal Processing and CommunicationSystems (ISPACS rsquo05) pp 405ndash408 HongKong China Decem-ber 2005
[16] M Srinivas and C K Mohan ldquoEfficient clustering approachusing incremental and hierarchical clustering methodsrdquo inProceedings of the International Joint Conference on NeuralNetworks (IJCNN rsquo10) pp 1ndash7 July 2010
[17] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[18] A Hinneburg and D A Keim ldquoAn efficient approach to clus-tering in large multimedia databases with noiserdquo in Proceedingsof the International Conference Knowledge Discovery and DataMining pp 58ndash65 1998
[19] K Elangovan GIS Fundamentals Applications and Implemen-tations 2006
[20] S Chawla and S Shekhar ldquoModeling spatial dependencies formining geospatial data an introductionrdquo Geographic DataMining and Knowledge Discovery vol 75 no 6 pp 112ndash1201999
[21] M-Y Cheng and G-L Chang ldquoAutomating utility route designand planning throughGISrdquoAutomation in Construction vol 10no 4 pp 507ndash516 2001
[22] Q Cao B Bouqata P D Mackenzie D Messier and J J SalvoldquoA grid-based clusteringmethod formining frequent trips fromlarge-scale event-based telematics datasetsrdquo in Proceedingsof the IEEE International Conference on Systems Man andCybernetics (SMC rsquo09) pp 2996ndash3001 San Antonio Tex USAOctober 2009
[23] K Krishna and M N Murty ldquoGenetic K-means algorithmrdquoIEEE Transactions on Systems Man and Cybernetics B vol 29no 3 pp 433ndash439 1999
[24] D Pelleg and A W Moore ldquoX-means extending KM withefficient estimation of the number of clustersrdquo in Proceedingsof the 70th International Conference on Machine Learning pp727ndash734 2000
[25] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining pp 226ndash231 1996
[26] P Bajcsy and N Ahuja ldquoLocation- and density-based hierar-chical clustering using similarity analysisrdquo IEEETransactions onPatternAnalysis andMachine Intelligence vol 20 no 9 pp 1011ndash1015 1998
[27] J H Ward Jr ldquoHierarchical grouping to optimize an objectivefunctionrdquo Journal of the American Statistical Association vol 58pp 236ndash244 1963
[28] J ErmanM Arlitt and AMahanti ldquoTraffic classification usingclustering algorithmsrdquo in Proceedings of the ACM Conferenceon Applications Technologies Architectures and Protocols forComputer Communication (SIGCOMM rsquo06) pp 281ndash286 PisaItaly September 2006
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of
International Journal of
AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Active and Passive Electronic Components
Control Scienceand Engineering
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
RotatingMachinery
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
Journal ofEngineeringVolume 2014
Submit your manuscripts athttpwwwhindawicom
VLSI Design
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Shock and Vibration
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Civil EngineeringAdvances in
Acoustics and VibrationAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Advances inOptoElectronics
Hindawi Publishing Corporation httpwwwhindawicom
Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
SensorsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Navigation and Observation
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
DistributedSensor Networks
International Journal of