a system for social network extraction of web complex structures

9
(IJCSIS) In ternat ional Journal of Computer Science and Informati on Security, Vol. 9, No. 8, 201 1  A system f or social network extraction  of web  complex structures  Amir Ansari, Mehrdad Jalali Department of Software Engineering Mashhad Branch, Islamic Azad University , Mashhad, Iran .   Abstract —A social network is social structure composed of nodes which generally are individually or organizationally and are con- nected based on several common traits. Currently, social net- works are applied in wide parts of the web. Its popularity is due to discovery of connections hidden in the web and its representa- tion in a visual manner. Social networks are used in many levels from households to nations. They play an important role in find- ing solutions to problems, management of organizations and in individual’s success rate to meet their goals. It is difficult to de- fine social web structure and resulting structures are often very complex, unperceivable and unusable. They are usually limited to specific usage and production steps of a network for each struc- ture are not provided comprehensively and homogenously. In present study we present a comprehensive system to discover social network s from each struct ure and we have used web u sage mining techniques to discover hidden data existing in web server log file and used these data to remove social network challenges. In this system, a novel architecture has been provided for users clustering. We will use site structure to achieve better results and delete users review pages. Results of analysis indicate that this kind of discovery can be used for most applications and struc- tures and this system will discover all available connections in web site as much as possible.  Keywords-  Social   Network, Social Network extraction, Social  Network Analysis, Web Usage Min ing I. I  NTRODUCTION With growing increase in information volume and web de- velopment, methods and techniques are needed that enable discovering useful information from available data. Web min- ing is one of study areas that discover information from docu- ments and web services by using data mining techniques. Web mining is divided to 3 classes including: web content mining, web structured mining, web usage mining [1]. Web content mining discovers useful information's such as text, picture and voice from web documents contents. Web structured mining discovers structural information from web and determines cor- relation graph pattern in the site. And web usage mining dis- covers useful patterns from data produced in relationships be- tween users and web servers. Applications of these kinds of discoveries are usually including web personalization creating comparative web sites, user modeling and etc. Increasingly usage of technique is sue to using data in which all the merri- ments of user are gartered in an homogenous manner by web server without direct intervention of user. Additionally all have this dataset and user is not required to create inter system pro- file. A social network is social structure compotes of nodes which are generally individual or organizational and are con- nected to each other through some traits. Nowadays, social networks are considered as most popular web services web recent world recognizes internet social networks not only as a  base for being familiar to friends but also as a network of col- lective communications Among challenges in this subject are specialty of net work, its complexity and lack of any compre- hensive system to mine social network from any kind of sites. For this purpose, a comprehensive system is required the ability to implement for any kind of structure and application. In  present study, we will present a system for social network ex- traction that in addition of being able to implement in any site could discover useful networks from available data in web server log file by using Web Usage Mining techniques. This system is not limited to any site or structure. And it is able to extract all aspects of social networks such as commercial, cul- tural, personal, political and military aspects. In addition, sug- gested system has an acceptable speed to extract information and mapping social network and it will resolve problems re- lated to new users. In section 2, related works are described. In section 3, sug- gested system and its component are defined. System evalua- tion is described in section 4 and analyzing extracted social network is provided in section 5. Finally conclusions are de- scribed in section 6. II. R ELATED WORKS In recent year s, social networks are growing seriously. So- cial networks generally are very attractive places which have  been programmed using newest technologies. Every single  part has been programmed and they are derived very purpose- fully to achieve all objectives defined for website. Now, inter- net social networks not only are bases for being familiar to friends, but also are networks of collective communications. For this purpose, some studies have been performed in this respect and other related subjects that some of them will be referred sub sequently . In [2] by using web usage mining, relationships between web pages and their observation in user sessions are discov- ered through correlation rules. These relations usually are used for personalization. Additionally relationships between users are achieved through their item series. K-means algorithm is used for improving this method in which users interactions are clustered. A cluster of interaction represents users with similar 67 http://sites.google.com/site/ijcsis/ ISSN 1947-5500

Upload: ijcsis

Post on 07-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

8/4/2019 A System for Social Network Extraction of Web Complex Structures

http://slidepdf.com/reader/full/a-system-for-social-network-extraction-of-web-complex-structures 1/9

(IJCSIS) International Journal of Computer Science and Information Security,

Vol. 9, No. 8, 2011 

 A system for social network extraction  of web

 complex structures 

Amir Ansari, Mehrdad Jalali 

Department of Software Engineering

Mashhad Branch, Islamic Azad University, Mashhad, Iran . 

 Abstract—A social network is social structure composed of nodeswhich generally are individually or organizationally and are con-nected based on several common traits. Currently, social net-works are applied in wide parts of the web. Its popularity is due

to discovery of connections hidden in the web and its representa-tion in a visual manner. Social networks are used in many levels

from households to nations. They play an important role in find-ing solutions to problems, management of organizations and inindividual’s success rate to meet their goals. It is difficult to de-fine social web structure and resulting structures are often verycomplex, unperceivable and unusable. They are usually limited to

specific usage and production steps of a network for each struc-ture are not provided comprehensively and homogenously. In

present study we present a comprehensive system to discover

social networks from each structure and we have used web usagemining techniques to discover hidden data existing in web server

log file and used these data to remove social network challenges.In this system, a novel architecture has been provided for users

clustering. We will use site structure to achieve better results anddelete users review pages. Results of analysis indicate that thiskind of discovery can be used for most applications and struc-

tures and this system will discover all available connections inweb site as much as possible.

 Keywords-  Social   Network, Social Network extraction, Social  Network Analysis, Web Usage Mining

I.  I NTRODUCTION 

With growing increase in information volume and web de-velopment, methods and techniques are needed that enablediscovering useful information from available data. Web min-ing is one of study areas that discover information from docu-ments and web services by using data mining techniques. Webmining is divided to 3 classes including: web content mining,web structured mining, web usage mining [1]. Web content

mining discovers useful information's such as text, picture andvoice from web documents contents. Web structured miningdiscovers structural information from web and determines cor-relation graph pattern in the site. And web usage mining dis-covers useful patterns from data produced in relationships be-tween users and web servers. Applications of these kinds of discoveries are usually including web personalization creatingcomparative web sites, user modeling and etc. Increasinglyusage of technique is sue to using data in which all the merri-ments of user are gartered in an homogenous manner by webserver without direct intervention of user. Additionally all havethis dataset and user is not required to create inter system pro-file.

A social network is social structure compotes of nodeswhich are generally individual or organizational and are con-nected to each other through some traits. Nowadays, socialnetworks are considered as most popular web services webrecent world recognizes internet social networks not only as a

 base for being familiar to friends but also as a network of col-lective communications Among challenges in this subject arespecialty of net work, its complexity and lack of any compre-hensive system to mine social network from any kind of sites.For this purpose, a comprehensive system is required the abilityto implement for any kind of structure and application. In

 present study, we will present a system for social network ex-traction that in addition of being able to implement in any sitecould discover useful networks from available data in webserver log file by using Web Usage Mining techniques. Thissystem is not limited to any site or structure. And it is able toextract all aspects of social networks such as commercial, cul-tural, personal, political and military aspects. In addition, sug-gested system has an acceptable speed to extract informationand mapping social network and it will resolve problems re-lated to new users.

In section 2, related works are described. In section 3, sug-gested system and its component are defined. System evalua-tion is described in section 4 and analyzing extracted socialnetwork is provided in section 5. Finally conclusions are de-scribed in section 6.

II.  R ELATED WORKS 

In recent years, social networks are growing seriously. So-cial networks generally are very attractive places which have

  been programmed using newest technologies. Every single

 part has been programmed and they are derived very purpose-

fully to achieve all objectives defined for website. Now, inter-net social networks not only are bases for being familiar to

friends, but also are networks of collective communications.

For this purpose, some studies have been performed in thisrespect and other related subjects that some of them will bereferred subsequently.

In [2] by using web usage mining, relationships between

web pages and their observation in user sessions are discov-ered through correlation rules. These relations usually are used

for personalization. Additionally relationships between users

are achieved through their item series. K-means algorithm is

used for improving this method in which users interactions areclustered. A cluster of interaction represents users with similar 

67 http://sites.google.com/site/ijcsis/

ISSN 1947-5500

8/4/2019 A System for Social Network Extraction of Web Complex Structures

http://slidepdf.com/reader/full/a-system-for-social-network-extraction-of-web-complex-structures 2/9

(IJCSIS) International Journal of Computer Science and Information Security,

Vol. 9, No. 8, 2011 

 behaviors. Using this method is not suitable for data with largescale.

In [3] a system called WebPUM is introduced. It performs

online prediction using web usage mining and suggests a new

technique for classifying user movement patterns which usesthis technique for predicting users’ future behavior. In this

technique new algorithm of graph division was used for mod-

eling user movement patters and greatest common subse-

quence algorithm was used to classify user activities.Another method based on clustering in web-CANVAS has

  been provided. In this method users with similar movement

 pattern are places in same cluster [4]. In this technique due to predefined and stationary definition of clusters, some limita-

tions will be created for improving the web site in future [3].

Various methods have been developed to recognize web

communities which are divided to methods based on Hyper-link analysis and methods based on graph theory. Among me-

thods based on hyperlink analysis we can refer to studies per-

formed in [5] and [6]. Method offered in [5] receives a prima-ry of pages as input and discovers their communities. In thismethod, RPA algorithm (related page algorithm) is used to

define similar pages. By using this algorithm, similar pages

are divided to groups and web communities are achieved.Technique offered in [6] is one of most important methods of 

recognizing web communities. In this method, Hub and Au-

thority page series are introduces as web communities. An

authority page is containing valuable information's on a certainsubject. A hub page is containing some links to authority pag-

es. This technique recognizes Hub and Authority pages by

using HITS algorithm. Due to rapid growth of web pages anddependence of this algorithm to web site structure, there are

some problems related to using it in real world.In [8] a system has been developed to extract a social net-

work from web called referring web. This system will focuson names of individuals within web pages. Search engine pro-

vides the relationship between these names. Amount of rela-

tionship between two individuals x, y will be achieved byquery from “x and y” in search engine. Tow persons are more

strongly related if they have more similarity in home papas,

scientific articles and organizational charts. Achieved similari-

ty will provide a path from a person to another person.In [9,10,11,12], methods based on graph theory we used.

But due to rapid growth of web and large scale data, usingalgorithms based on graph theory is not possible due to being

time-consuming. Web communities are defined as concen-trated parts in graphs obtained through relationship between

users and how they mover in web pages. In [9], web com-

munities are obtained by using two part complete graphs withtrawling technique. In [10], web communities are obtained

through discovering all two part complete k3,3 graphs and

integrating them to each other. In [11,12], by using maximum

current algorithm, nods with higher number of inter collectionlinks compared to numbers of out collection are considered as

web communities.

In [13], a system called Flank has been described to ex-tract, compress and online presentation of social networks for 

virtual web communities. These social networks are provided

using web page analysis, E-mail messages, magazines andself-created profiles (FOFA files). Web mining components

used in Flink are consistent to [8]. In this method for discover-

ing related names, it has attempted to count the frequency of 

concurrently repetition of two names with each other with helpof describing conceptual query within a search engine.

In [14,15], wide spread studies have been conducted on

virtual web. PANKOW system provides pattern based inter-

 pretation through web knowledge. Nomination of an identityin several verbal patterns makes virtual concept designation.

Virtual relationship between samples and their concepts is

  provided through sending query to Google API library. Pat-terns with greatest consistency refer to a concept of identity

nomination on the web. Main idea in PANKOW refers to self 

annotate. This idea with access to web general data and struc-

ture, tries to virtual interpretation of local sources, in turn itwill cause self-set up in semantic web.

In [16], it attempted to extract a web by using log file

which reflects users’ movement manner. For this purpose, ithas utilized statistical techniques and data mining, associationrules mining, frequent closed patterns and sequential associa-

tion rules. By using these techniques, high frequency items in

the collection are discovered and resulted in formations aresend to site manager for making advertisements purposeful.

Among benefits of this technique is graphically representation

of users’ movement manner in the site using algorithms such

as frequent closed pattern and association rules mining in logfile with high volume to define frequent items has time com-

 plexities. In addition it is not involving dynamic pages.

In summary, most works performed to mine social net-works have used auxiliary systems, user profiles or log files.

Works conducted to mine social networks generally were spe-cifically designed such as for commercial educational purpos-

es. In techniques used to mine social networks usually graph-  based methods are utilized which are not useable for large

scale data due to being time-consuming and requiring huge

memory space. So there is a need to a system for mining socialnetworks from log files due to richness and homogeneity of all

users information's in this file. This system should mine all the

relationships and aspects required for social networks.

III.  PROPOSED SYSTEM 

System proposed in present study is for mining social net-work from log file available in web server. Figure 1 shows ar-chitecture of this system.

Figure 1. proposed system's architecture

68 http://sites.google.com/site/ijcsis/

ISSN 1947-5500

8/4/2019 A System for Social Network Extraction of Web Complex Structures

http://slidepdf.com/reader/full/a-system-for-social-network-extraction-of-web-complex-structures 3/9

(IJCSIS) International Journal of Computer Science and Information Security,

Vol. 9, No. 8, 2011 

This system uses log file. Web server records all accessesof users in log file. There are several kinds of different struc-tures of log file which are provided according to web server arrangements. In log files with different structures usually in-formation's such as services IP address, application time, appli-

cation page and html status reagent code are common. This fileusually has large volume and is consist of additional informa-tion's. For mining the pattern from contents of log file, a stepcalled data preparation must be applied on this file data, so thataccess to consistent data being possible. This step is including:data collection, data cleaning, session identification, user iden-tification and data summarization.

1)    Data collection: In this step we will attempt to collectlog files from several web servers related to this site. 

2)   Data cleaning: This step will clean up web crude data. Itwill review available data and remove its additional items.When user requests a page, this request will be recorded in logfile and page contents such as picture, voice, video, … will be

recorded in the file as a new request from user without his/ her direst intervention. Applications recorded through robots andweb creepers such as CGI scripts and information's relating toweb page such as pictures, voices, video files, …. are consi-dered as additional information's [17]. Correct recognition of additional information's and removing them will improve re-sults quality in output. 

3)  Session identification: User session indicates its beha-vior. Then correct identification of these sessions is very im-

 portant. A user session is a set of pages visited by him/ her dur-ing a certain visit. 

S= < p1, p2, …, pn >

Various discovering methods are used to identify userssessions [20]. Discovery methods are classified to two time-

 based methods and subject-based methods. In time-based me-

thods, a set of pages visited by user are considered as one ses-

sion, if those pages are in a lower time scale or equal to a de-

fine time scale. This certain time is defined as session visit

time and it various between 25.5 minutes are considered as

  proposition. A shortcoming for this technique is elasticity

which may misdirect the system in identifying sessions end;

subject-based methods refer to a series of visited pages or refer 

to competing a conceptual unit of work by user. This methods

shortcoming is finding the relationship between pages or de-

fining a conceptual unit of work. In present study, we have

used time-based method.4)  User identification: Users identification from log file is

most important step in data preparation. There are various me-thods to identify user. Several systems introduce their usersthrough login. But this technique is not acceptable due to usersavoidance to do this act, uncertainty on accuracy of informa-tion's entered by users and lack of this ability for all sites to usethis purpose. Simplest way is designating distinct IP addressesfor every user. But precision of this method is low due to proxyservers available [18]. Cookies are also useful to identify visi-tors of a site. But it is not uses usually because it is consideredas a threat for security and privacy and because users link through different machines to server. Discovering techniques

are used due to these problems. In [17], web site structure isintegrated to log file to discover users. Presumably a new user has accessed to the site if request IP address in a page is sameas IP address requested in another page and there is no directlink between these pages. In [19] user identification is per-

formed thought integrating information's available in eventrecording file such as IP address, kind of operation system andreviewing software. In present study we will identify users us-ing technique developed in [19]. 

5)   Data summarization: Amount of time a user resides in a page, will identify value of that page from user view point andthis parameter is of great importance. If residing time of user inthe page P is low, this page has low importance from user view

 point. Valuing filter delete those pages that have been request by user and have low value. Totally it is not genuine to delete  pages with low visiting time since the amount of time a user reside in a page is not dependent only on his/ her interest to that

 page. Amount of page visiting time is also dependent to proper-

ties and content of that page. Statistical tests may define someof these properties. For example users spend lower time insearching pages compared to content pages. Web pages aredivided into three categories. Firstly, review pages which thereare not much content and generally are including list of links tocontent pages. These pages are used only for reviewing the site.Secondly, content pages which are including data favorites for user. Thirdly combination pages with properties of both per-vious groups. Boundary between these classifications is notdefining due to strong dependency on User view point andhis/her behavior. A page may be considered as search page byone user and being considered as content page by another user. 

Extracted information in final steps of preparation are includ-

ing all the pages available in the site, site users, pages

screened by every user, time spent in each page by every user 

and sessions of all users. We use these information and statis-

tical techniques to derive average time spent in each page by

every user and the number of events in each page among all

sessions as well as to find the number of users, pages and ses-

sions. Extracted information is arranged in encoded manner to

increase system speed. Pages with similar names but different

 parents have different codes. This collection is placed in User 

Data relation data base.

 A.  Novel Multi – level clustering architecture (NMLC – SN)

Proposed architecture is consisted of steps including sitestructure mining, outliers deletion, user interest discovery, us-

ers clustering and compressing steps relating to this part are performed sequentially and use data sets available in User Datadatabase. Each step uses data sets prepared by previous step.Each step will be described as follows.

1)  Site structure mining: This step aims to site structure

mining to help better representation of social network as well

as its compression. We will use all the pages available in site

which are present in User Data database for site structure min-

ing. For this purpose, we will use BFS algorithm as follows:

a)  Finding structure root (root is located in level 0 anddoesn’t contain a parent).

69 http://sites.google.com/site/ijcsis/

ISSN 1947-5500

8/4/2019 A System for Social Network Extraction of Web Complex Structures

http://slidepdf.com/reader/full/a-system-for-social-network-extraction-of-web-complex-structures 4/9

(IJCSIS) International Journal of Computer Science and Information Security,

Vol. 9, No. 8, 2011 

b)  Discovering nods available in level I .

c)  Record level I nodes in the database as a node name,node code, number and level of father.

d)  Add one to level I and go to step 2 until no new nodes

are found.

Graph without cycles and undirected the structure to be ob-tained.

2)  outliers deletion:  This step aims to delete unuseful

 pages to improve quality of mined social network. In this step,

 pages with low access and average time spent in all accesses

are called outlier pages. These pages, due to low effect on

system, make scattered clusters, in turn, this causes social

network with large unusefull information. Pages will low

access are those pages towhich users display lowest interest.

Pages with low average spent time are composed of search

 pages or pages to which users show low interest. Parameters α,

β are used as filter to delete outlier. pages with number of access lower than α and average spent time of access lower 

than β second will be deleted. These   parameters must be

arranged in a manner that contain lowest amount of data loss

and being involved all outliners. It’s worth mentioning that

values of these parameters vary for each site. That is, content

and average spent time for all site pages effect on α and β  

values. For example, if lowest average spent time for all site

 pages is 30, predefined value of 15 is not suitable for β. this

value may be defined by an expert person or site manager.

3)  User interest discovery: This step aims to discover 

interested pages of every user. In this step the number of visits

of every user for each page are calculated . the number of visit

for each page indicates user interest to that page. A threshold

has been considered for the number of visits for each page

which is placed in µ. So, we will delete each page with the

number of visits lower or equal to threshold limit µ. pages to

which users are not interested are those which users visited

them lower than other site pages. If these pages are used in

clustering, they will cause that users being placed in wrong

clusters additionally some clusters will low user number will

 be created which are considered as noise from end users view.

Value of parameter µ is defined based on the need of social

network applicator. If person needs a network with all details,

he may consider this parameter in low value and if he needs a

network will lower details, he may increase this value. 4)  Users clustering: This step aims to cluster users based

on their behavior or pages values. For this purpose, surveyed

 pages and the number of visits of difined page by every user 

will be transformed to this step. Table I represents a sample of 

information collected in previous step which is used in this

step to clustering users.

TABLE I. A N EXAMPLE OF USERS AND THEIR S URVEYED INFORMATION.

  A B C D E F

User0 0 0 2 5 0 0

User1 4 0 4 3 0 0

User2 0 2 4 8 0 0

User3 3 3 0 0 0 3

User4 2 3 0 0 0 3

User5 4 0 0 0 0 4

User6 0 14 0 0 0 4

User7 0 0 5 0 5 0

User8 8 0 0 3 7 0

User9 0 0 5 2 6 0

Pages available in data transformed to this step are from

site structure tabs obtained in step 1. These pages didn’t deletein outliers deletion in step 2. Figures in this table indicate the

number of user visit from pages A, B, C, D, E, F. Zero value

in the table is due to 2 reason. Either user didn’t visit that page

or the number of user’s visit was lower than parameter µ rela-

tive to step 3. In this step, firstly, data are transformed to prop-

erties related to every user. In other words, it will be defined

that how much a user visited each page. It’s difference to data

collected in previous step is that data will be shifted from be-

ing a collection and transform to components with 3 proper-

ties. Table II indicates transformed data table I to components

with 3 properties.

TABLE II. I NFORMATION OF TABLE I TRANSFORMED TO COMPONENTS

WITH 3 PROPERTIES.

Users Pages number of user’s visit

1 A 4

3 A 3

4 A 2

5 A 4

8 A 8

2 B 2

3 B 3

4 B 3

6 B 14

0 C 2

1 C 4

2 C 4

7 C 59 C 5

0 D 5

1 D 3

2 D 8

8 D 3

9 D 2

7 E 5

8 E 7

9 E 6

4 F 3

5 F 4

6 F 4

70 http://sites.google.com/site/ijcsis/

ISSN 1947-5500

8/4/2019 A System for Social Network Extraction of Web Complex Structures

http://slidepdf.com/reader/full/a-system-for-social-network-extraction-of-web-complex-structures 5/9

(IJCSIS) International Journal of Computer Science and Information Security,

Vol. 9, No. 8, 2011 

For user clustering, users visited a certain page will be

  placed in same cluster because they have common interests.

As well, because in previous steps, we applied parameters to

delete outlier pages, intermediate pages and those pages really

are not interested by users, these pages are truly interested byusers visiting them. Table III indicates table II information

clustering.

TABLE III. TABLE II INFORMATION’S CLUSTERING.

Cluster Users

0 1,3,4,5,8

1 2,3,4,6

2 0,1,2,7,9

3 0,1,2,8,9

4 7,8,9

5 4,5,6

After user’s clustering we will designate a weigh for each

cluster to show their importance rate. By weighing, the impor-

tance of each provided cluster will be determined. Clusters

arrangement according to importance of each cluster will be

 performed and clusters with low or no importance will be de-

leted. Weighting for clusters will be determined based on need

of social network applicator.Weight of each cluster will be defined in 2 ways:

a)  the number of users in each cluster, which define the

number of users in the cluster.

b)  the number of visits to that page by all the users which

determines a certain page clustering has been performed

according to it, how much was visited by all cluster users.

the number of users in each cluster represents movement path and users focus on certain pages in the site. the number of 

visit to a page by all the users represents valuables pages in thesite. Each cluster’s weight is determined based on weight of 

other clusters . and will be given through following relation:

=

  (1)

In this equation, W i is the weight designated for eachcluster which is [0..1]. wi is cluster weight according to

method used to determine the weight and denominator is the

total weight of all clusters. Table IV shows cluster’s weight

obtained in table III by both described methods.

TABLE IV. CLUSTER WEIGHT OBTAINED IN TABLE III

Cluster cluster’s weight

based on Number of 

users in the cluster

cluster’s weight based

on Number of page

views by users

0 0.2 0.193

1 0.16 0.202

2 0.2 0.138

3 0.2 0.174

4 0.12 0.165

5 0.12 0.128

After defining cluster weight, we will apply a threshold

limit which deletes the clusters with low importance or 

clusters not considered by applicator. Value is a threshold

limit for clusters weights. clusters with weight lower than

threshold limit will be deleted. This will deleted additional

cluster in the network. Defining parameter value is based on

need of social network applicator. This value deletes a number 

of crests in the network. The crests indicate rate of details inextracted social network.

5)  Compression:  This step aims to better representation

and decrease of complexities available in social network. This

step is only for networks extracted based on weighting of the

number of users in the clusters. Compression in this step is

conducted through discovering similar clusters. Similar 

clusters are clusters in which subtraction of all users of each

cluster with subscription of all users of each clusters is equal

to empty. This way will decrease the number of clusters,

compresses and better shows social network. For example

assume that user 1 belongs to clusters A, B, C and user 2 also

  belongs to clusters A, B, C user 3 belongs to cluster A, B.

Than A, B may be called a cluster called AB and users 1, 2, 3 being plased in it and users 1 and 2 being placed in cluster C.

In Figure 2, this example can be observed.Clusters

UsersC   B  A 

***1

***2

**3

⇓ Clusters

UsersC   A,B **1

**2

*3

Figure 2. An example of compression

IV.  SYSTEM EVALUATION 

In this section, evaluation parameters used in this studywill introduced.

 A.  Evaluation parameters

Information retrieval metrics such as Recall, Precision and

F-Measure have been used to test accuracy and efficiency

[21,22].

1)   Recall: It is a common metric using for evaluation of 

utility of proposed algorithm and acts as equation 2:

Recall = ∩

Where Th is all the members within the cluster obtained by

expert and Tr  is all the Members within the cluster obtained by

the system.

2)  Precision: A common metric using to evaluate usefulness

of proposed algorithm and acts as equation 3:

Precision =∩

71 http://sites.google.com/site/ijcsis/

ISSN 1947-5500

8/4/2019 A System for Social Network Extraction of Web Complex Structures

http://slidepdf.com/reader/full/a-system-for-social-network-extraction-of-web-complex-structures 6/9

(IJCSIS) International Journal of Computer Science and Information Security,

Vol. 9, No. 8, 2011

3)  F-measure:  is another evaluation metric which isobtained using parameters precision (P) and Recall (R) as

equation 4:

F =

=

This metric is used to represent the relationship betweenRecall and precision.

V.  A NALYSIS OF MINED SOCIAL NETWORK  

In present study CTI dataset has been used to create andanalyzing social network. System used in tests had Intel Pen-

tium T4400 @ 2.2 GHz processor, Ram 3.0 GB, Hard Disk 320 GB and windows XP SP3 operating system. implementing

  proposed system has been performed by using VB.Net pro-

gramming language from Microsoft visual studio.Net 2008

collection, Microsoft Sql Server 2005 and Pajek software [23].A social network is representation of relationship or arc be-

tween factors and persons A social network is a G = (V, E)

graph where V is a series of heads each indicating the persons

and E is a series of crests each indicating relationship between  persons. Figures 3 and 4 represent site structure which has

 been obtained in step 1 from NMLC – SN architecture.

Figure 3. site structure without representation of leaf pages

Figure 4. representation of a user’s movement path from root to leaf series

At the end of step 1, tree’s height obtained from site structure

is set 8. In figure 2, due to leaf abundance, site structure has

 been indicated without its leafs. Figure 3 represents a certain

user’s movement path with information related to its pages.We had set α value as 3 and β value as 5. In figure 5, influence

of α and β parameters have been represented. The more lower 

values we set for these parameters, the lower pages will be

deleted. We have set µ equal to 1. Figure 6 indicates effect of 

µ parameters have been represented. Low value of this para-

meter will delete lower numbers of records and users from

dataset. But it is possible that pages, to which user is not inter-ested, not being deleted correctly.

72 http://sites.google.com/site/ijcsis/

ISSN 1947-5500

8/4/2019 A System for Social Network Extraction of Web Complex Structures

http://slidepdf.com/reader/full/a-system-for-social-network-extraction-of-web-complex-structures 7/9

(IJCSIS) International Journal of Computer Science and Information Security,

Vol. 9, N o. 8, 2011 

Figure 5. Effects of α and β parameters  Figure 6. . Effects of µ parameter By implementation of system with designated values for 

each parameter on introduced dataset, we have obtained 2417

clusters. Users clustering has been conducted by both kinds of 

weightings to clusters. Figure 7 represents 13 clusters of ex-

tracted clusters with their users located in them. Figures indi-cated in figure 7 represent the code related to a given cluster’s

users.

Figure 7. A part of extracted social network 

Figure 8 represents parameter effect on users clustering based on clusters weighting with the number of cluster users.

Figure 9 indicates parameter effect on users clustering basedon clusters weighting with all the visits of cluster users.

Figure 8. effect of  parameter on users clustering based on

weighting clusters with the number of cluster users 

0

200

400

600

800

1000

1200

1400

1600

1800

2000

α=1 α=2 α=3

   N  u  m   b  e  r  o   f  p  a  g  e

  s   d  e   l  e   t  e   d

β=5 β=6 β=7 β=8

0

1000

2000

3000

4000

5000

6000

µ=1 µ=2 µ=3 µ=4 µ=5

Number of user deleted Number of record deleted

0

500

1000

1500

2000

2500

   N  u  m   b  e  r  o   f  c   l  u  s   t  e  r  s

73 http://sites.google.com/site/ijcsis/

ISSN 1947-5500

8/4/2019 A System for Social Network Extraction of Web Complex Structures

http://slidepdf.com/reader/full/a-system-for-social-network-extraction-of-web-complex-structures 8/9

(IJCSIS) International Journal of Computer Science and Information Security,

Vol. 9, No. 8, 2011 

Figure 9. Effect of  parameter on users clustering based on

cluster weighting with total number of cluster users visits By implementing system compression step with values

designated to every parameter on defined dataset and equalto zero, we obtained 1491 clusters.

To evaluate the system with metrics defined in previoussection, 150 users have selected randomly with equal probabil-ity. Relationships between users have been extracted by an

expert. In the first examination, Recall metric obtained was

0.8236, precision was 0.9375 and F-Measure was 0.8769. Insecond test groups size was set to 6 which is minimum size of 

extracted groups. Results of this test are shown in figure 10.

Figure 10. Results of second test’s recall and precision

In third test, group size was set to maximum size in ex-

tracted groups. Results of this test are shown in figure 11.

Figure 11. result of third test’s recall and precision

Among benefits of proposed system compared to other me-

thods is acceptable speed in network production. Consumedmemory rate, better efficiency due to lack of using frequent

close patterns and association rules mining, extraction of vari-ous social networks according to applicator’s need and ability

to mine social networks from all sites.

VI.  CONCLUSIONS AND FUTURE WORKS 

In present study, we will introduce available challenges in

social networks mining and we will develop a system for re-

moving available challenges. In this system, by consideringaverage spent time in each page by every user, we will obtain

importance rate of each page. This way, we attempted to de-lete false positive associations and traits. We will offer a com-

 prehensive system for social network mining which removes problems available in social networks in a manner. The system

has the capacity to mine social network from any site. Despite

recording false information in web pages, this system, due to

using Web Usage Mining techniques, could construct user’svirtual community from within log file of web server. It is able

to offer various social networks for political and military func-

tions for discovering information from given parts with re-quired activity. Additionally in e-learning function, it tries to

identify superior active teachers and interesting courses for students. It is used in electronic commerce sites for making

advertisements and proposed product systems purposeful. It isable to discover some parts with high activity when using in-

formation from portable web server’s log file and could help

to organization manager to improve this part through offeringsome recommendations. The system, according to persons

location and geographical aspect, could suggest server location

to improve server efficiency proposed system has an accepta-

 ble speed to extract information and drawing social networks.To improve the system, semantic web techniques may be used

to extract the relationships between pages and their relations tousers.

R EFERENCES 

[1]  R. Cooley, B.Mobasher and J.Srivastava, ‘Web Mining Information andPattern Discovery on the World Wide Web’ , Information Gatheringfrom Heterogeneous Distributed Environments, December 2001.

[2]  Mobasher, B., Cooley, R., & Srivastava, J. “Creating adaptive Web sites

through usage-based clustering of URLs”. Paper presented at theknowledge and data engineering exchange, Chicago, IL, USA (pp. 19– 25) 1999.

[3]  Mehrdad Jalali, Norwati Mustapha, Md. Nasir Sulaiman, AliMamat,“WebPUM: A Web-based recommendation system to predict

user future movements”, Expert Systems with Applications 37 (2010)6201–6212.

[4]  Cadez, I., Heckerman, D., Meek, C., Smyth, P., & White, S.“Visualization of navigation patterns on a Web site using model-basedclustering”. Paper presented at the proceedings of the sixth ACMSIGKDD international conference on data mining and knowledgediscovery, Boston, Massachusetts, United States 2000.

[5]  Toyoda, M., Kitsuregawa, M., “Creating a Web Community Chart for  Navigating Related Communities”, In Proc. of Hypertext 2001, pp.103-112, 2001.

[6]  Gibson, D., Kleinberg, J. M., Raghavan, P., “Inferring WebCommunities from Link Topology”, In Proc. of the 9th ACMConference on Hypertext and Hypermedia.Pittsburgh, PA, pp. 225-234,1998.

[7]  Kleinberg, J., “Authoritative Sources in a Hyper-linked Environment”,Proc. of ACM-SIAM Symposium on Discrete Algorithms, 1998. Alsoappears as IBM R esearch Report RJ 10076(91892) May 1997.

0

500

1000

1500

2000

2500

   N  u  m   b  e

  r  o   f  c   l  u  s   t  e  r  s

0

20

40

60

80

100

16.67 33.33 50 66.67 83.33 100

   P  r  e  c   i  s   i  o  n

Recall

0

20

40

60

80

100

10 20 30 40 50 60 70 80 90 100

   P  r  e  c   i  s   i  o  n

Recall

74 http://sites.google.com/site/ijcsis/

ISSN 1947-5500

8/4/2019 A System for Social Network Extraction of Web Complex Structures

http://slidepdf.com/reader/full/a-system-for-social-network-extraction-of-web-complex-structures 9/9

(IJCSIS) International Journal of Computer Science and Information Security,

Vol. 9, No. 8, 2011 [8]  H. Kautz, B. Selman, and M. Shah. “The hidden Web”. AI magazine,

18(2):27–35, 1997.

[9]  Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A., “Trawling theWeb for Emerging Cyber-Communities”, Proc. of the 8th WWWConference, 1999.

[10]  Imafuji, N., Kitsuregawa, M., “Effects of Maximum Flow Algorithm onIdentifying Web Community”, Proc. of 4th international workshop onweb information and data management. ACM Press, NY, pp.43-48, 2002

[11]  Flake, G., Lawrence, S., Giles, C.L., “Efficient Identification of WebCommunities”, 6th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining. Boston, MA, pp. 150-160,

2000.

[12]  Flake, G. W., Lawrence, S., Giles, C. L., Coetzee, F. M., “Self-Organization & Identification of Web Communities”, IEEE Computer,Vol.35, No.3, pp. 66-71, 2002.

[13]  P. Mika. “Flink: Semantic web technology for the extraction andanalysis of social networks”. Journal of Web Semantics, 3(2), 2005.

[14]  P. Cimiano, S. Handschuh, and S. Staab. “Towards the self-annotatingweb”. InProc. WWW2004, pp. 462–471, 2004.

[15]  P. Cimiano, G. Ladwig, and S. Staab.Gimme´ . “the context: Context-driven utomatic semantic annotation with cpankow”. In Proc. WWW

2005, 2005.[16]  Muhaimenul Adnan , Mohamad Nagi , Keivan Kianmehr, Radwan

Tahboub , Mick Ridley ,Jon Rokne, “Promoting where, when and what?

An analysis of web logs by integrating data mining and social network techniques to guide ecommerce business promotions”, Springer-Verlag2010 SOCNE DOI 10.1007/s13278-010-0015-3.

[17]  R. Cooley, B. Mobasher and J. Srivastava, “Data Preparation for MiningWorld Wide Web Browsing Patterns”, Knowledge and InformationSystems, 1:1, 5-32, 1999.

[18]  D. Pierrakos, G. Paliouras, C. Papatheodorou and C. D. Spyropoulos,

“Web Usage Mining as a Tool for Personalization: A Survey”, User Modeling and User-Adapted Interaction, 13: 311-372, 2003.

[19]  Spiliopoulou, M., Mobasher, B., Berendt, B., & Nakagawa, M. “AFramework for the evaluation of session reconstruction heuristics inWeb-usage analysis”. INFORMS Journal on Computing, 15(2), 171– 190, 2003.

[20]  M. Spiliopoulou, L. C. Faulstich and K. Wilker, “A Data Miner Analyzing the Navigational Behavior of Users”, Proceedings of theWorkshop on Machine Learning in User Modeling of the ACAI99,Chania, Greece, 54-64, 1999.

[21]  Symeonidis, P., Nanopoulos, A., Manolopoulos, Y., ”A Unified

Framework for Providing Recommendations in Social Tagging SystemsBased on Ternary Semantic Analysis”, IEEE Transactions onKnowledge and Data Engineering, Volume: 22 Issue:2, 179 – 192, 2010.

[22]  Ian H. Witten, Eibe Frank, “Data Mining Practical Machine LearningTools and Techniques, Second Edition”, 2005.

[23]  Batagelj V, Mrvar A (1998) Pajek—program for large network analysis.http://vlado.fmf.uni-lj.si/pub/networks/pajek/

75 http://sites.google.com/site/ijcsis/

ISSN 1947-5500