[ieee 2011 7th international conference on emerging technologies (icet) - islamabad, pakistan...

1

Activity-based Correlation of Personal Documents and their Visualization Using Association Rule Mining

Zafar Saeed Department of Computer Science

Quaid-i-Azam University Islamabad, Pakistan

[email protected]

Abida Sadaf Institute of Information Technology

Quaid-i-Azam University Islamabad, Pakistan

[email protected]

Siraj Muhammad Department of Computer Science

Shaheed Benazir Bhutto University Sheringal, Pakistan

[email protected]

Abstract—It is a common observation nowadays that the personal information of user is difficult to manage, the material which is copied by the users to their personal system are often forgotten by the users. So when they require their information it becomes very difficult to find the relevant information from huge repository. We have introduced a method using which the activities of user for reading documents are captured from running process list and managed in a dataset along with accessing time, then frequent item set and associated weights are calculated for each document with other using Apriori Algorithm and confidence measure in conjunction with combined access time. When user searches a document, the document list appears using any conventional model of retrieval, we have used primary metadata including title, author, type for document searching. Beside this, a visual interface is designed to display the list correlated document on the basis of users activities may help them to indentify documents according to their past activities.

Keywords-Personal Information Management; Association Rule Mining; Correlation of documents

I. INTRODUCTION

The phrase “Personal Information Management” was used in 1980’s in order to manage the digital information on computer systems which tends to increase the information gaining speed as compare to human ability [11]. Personal information management describes the user’s activity in terms of creating, organizing, searching and managing their personal data [9]. The trend of personal information management system is not new, in past people used to manage the information manually on papers. With the passage of time as information has increased, people has designed their own methods for storing and archiving information including vertical file cabinets, file tags, and sorted file archives. When the information exceeds rapidly its management and retrieval become a challenging task [11].

Today’s world is progressive and very fast in development, so every person has his own information needs which he wants to get in time effective manner to meet the progressive culture of this world. People want to learn and get the appropriate information efficiently to save their time. People normally collect the information, but when they get an opportunity, of reusing that information, it becomes very hard for them to select the exact information from huge repository [3].

Tools and technologies help us in minimizing time which is needed for information management such as filing or e-mail management. With the help of these tools we may have more time to formulate creative and intelligent use of the information in order to get things done. Personal information management helps the people in different ways e.g. a patient with some disease such as cancer may needs to go through different medical treatment phases. So the doctor must have to maintain the historical treatment information of patient so that it will require less time to identify the ongoing treatment for that patient by maintaining his personal information.

So far a minimal amount of work has been done for inter-document task based correlation as compared to the conventional methods of task management. Storage and organization of personal information do have importance but user is more interested in identifying and locating relevant documents whenever needed. When user searches documents in personal machines he should be provided correlated documents not only on the basis of informational contents but also on the basis of tasks he performed earlier. Pattern of accessing documents highlight the interest of user in specific time interval. If the frequent patterns are identified with time intervals, it may help users to locate the correlated documents according to their interest. In this paper we have captured the daily activities of user, which shows that how user access documents in order to perform his/her tasks. We have introduced a technique for finding correlation among documents using association rule mining.

Association rule mining helps in identifying the common

patterns within the item list, in our case item list consists of documents opened by the user. These frequent patterns within the document access log of user show the correlation among these documents. We have calculated association weights using confidence measure in conjunction with total access time of each document with others.

This paper is organized as follows: Section 2 discusses the related work, section 3 describes association rule mining, section 4 describes the experimental setup, section 5 describes the results and analysis, and section 6 discuses conclusion and future work.

978-1-4577-0768-1/11/$26.00 ©2011 IEEE

2

II. RELATED WORK

In 1983 Tom Malone [12] conducted a survey study about paper management that how people usually manage their office documents and identified two main strategies i.e. neat and messy. Using neat strategy the user categorizes his information and place the document according to their category. In mess strategy the documents are placed in less structured way. William Jones et al [10] conducted a survey which describes how people arrange their personal projects. The author concluded that the people already crate planning documents, sometimes simple “to do list" and sometimes elaborate outline. On the basis of survey the authors design a tool called Universal Labeler and the project planner module control this work. Similar is done by Bergman et al [2]. The authors perform a survey on personal information management. The purpose of the survey is to test user's working habits empirically in order to overcome the fragmentation problem. On the basis of survey he suggested a method called single hierarchy solution in which all the related information about the project is stored in single folder regardless of format of the related files/documents. The authors designed a tool for single hierarchy, named Project Folders. In their system the entire project related items which include word documents, excel documents, html document, e-mail document are stored together regardless of their formats, but they were being separated by tabs at display.

In 2006 Xinlong Bao et al [1] presented a report on the tool FolderPredictor, which helped in reducing the cost of locating the file in hierarchical folders. FolderPredictor applies a cost sensitive prediction algorithm to the user's previous file access information to predict next folder that will be accessed. Experimental results show that FolderPredictor reduces the cost of locating a file by 50% on average. In 2006 Edward Cutrell et al [5] described the design and implementation of tool called Phalt that optimized the search in personal information and provided an interface that merges the searching and browsing activities. Also this tool supports a labeling scheme for organizing the personal content in the storage system.

A survey was conducted by Tristan Blac-Brude et al [3]. In this survey he concentrated on features which are used by the users to retrieve their documents. The main theme of the survey was to improve the tools that allow the user to improve the retrieving of related document. The authors suggested that the attributes that are most often and/or most precisely recalled, namely location, file type or document format, time of last usage, keywords, associated events and visual elements should be used in priority in the retrieval tools.

In 2007 David Elsweiler et al [7] examined that types of

task are required to re-retrieve the information, and on the basis of these tasks the author proposed a task based evaluation methodology and examined the feasibility of the approach.

Sara Cohen et al [4] addressed the desktop search problem and considered ranking of cos and sin distance tf.idf vector of the query and the content, path, name of the file respectively. In this technique the query was a set of words and the documents are ranked when at least on query word appear in the file name, content and path of the documents. The authors used two other methods including SVM and Selectiveness of feature which shows better results than basic ranking methods.

In 2008 Bruno Possas et al [13] suggested a new weighting

scheme for correlating the index terms in vector space model. In this weighting scheme, the author used association rule mining for calculating the weights of terms instead of inverse document frequency weighting scheme. He concluded that the new weighting scheme works efficiently which improves the quality of results.

Despite the popularity of the search engines, the retrieval

process is based on simple query-document matching and is made out of the user interests and preferences context. Daoud et al. [6] proposed an approach to personalize the user session information. They believe that user session information can provide very useful data regarding the topics in which user may be interested. They have claimed that their approach can distinguish between user's short term and long-term interests, and that way personalization will be more effective. Their proposed session measures rely on the topical closeness between the query concept representation and the user context, and personalization is achieved through re-ranking the search results of the relevant queries the user have entered and the context of the user.

Shen et al. [14] emphasizes on using the personal search history, because it is a very important type of personal information, from which we can learn about user's interests and information needs, thus improving the search service for the user and also improves the search accuracy. They have developed an intelligent client-side web search engine that can automatically extract the user's personal search history and store it on the local disk to utilize it for the purpose of personalization. In future work they have emphasized at using the other personal information stored on the user's pc, and also through some initial study, they found that a group of users such as peers in a research group often share some similar information needs, so in future if this data is to be utilized it can help in improving the retrieval accuracy of the search results.

Sun et al. [15] proposed a personalization technique by utilizing log files analysis. They provided personalization ranking of the search results, personalized ranking can be described as reordering documents by the similarity score between documents and user preference vectors. The evaluation has been done on a data set extracted from later access logs and compares it with other non-personalized ranking methods. The experiments show that their proposed method significantly improves the ranking accuracy on

978-1-4577-0768-1/11/$26.00 ©2011 IEEE

3

redirecting user actions. The limitation to their research is that their personalized model has been tested on one particular website and the generalization of the model is not tested. So, in future a generalized study will be very useful to see the difference with other websites.

TABLE I. TRANSACTION DATA HAS BEEN CONTAINING VARIOUS ITEMS

Transaction ID Items in Transaction

T1 I1, I2, I5

T2 I2, I5

T3 I2, I3

T4 I1, I2, I3

T5 I1, I3

T6 I2, I3

T7 I1, I3

T8 I1, I2, I3, I5

T9 I1, I2, I3

Related work shows that in previous research, researchers

have used different mechanisms and data to provide personalization of user information [16]. Previous research does not incorporate time as a factor or basis to provide personal information management. We propose that time can be a factor in identifying related documents that user may be accessing on the basis of their relevancy and user activity. For example if user has accessed two files at the same time and has not closed any one of them that means they were related but if the user has closed down one of them that means they were not that relevant. We have used association rule mining technique to uncover the related patterns about the documents which user have accessed at the same time.

III. ASSOCATION RULE MINING

Data mining is very useful in uncovering hidden patterns from data. Different data mining techniques have been used in the past to uncover such useful information which can be used for future predictions also. Association Rule Mining (ARM) is one such technique [8].

Data mining techniques are really helpful when there is

huge amount of data involved. Association rule mining has been used in different fields, e.g. to improve the decision making process of business related data, to enhance organization productivity in telecommunication networks (by finding customers and associated services enhancing those services). This data mining technique has also been applied to software engineering data to discover various patterns. ARM approach is data driven, and association rules may be found when the data is either transactional or relational. ARM was traditionally used to discover interesting association relationships among business transactions that consequently help in business decision-making process. Market Basket Analysis is an example of association rule mining [8]. Association rule mining finds the relationships among items which are to be used together. As an example consider

transactional database which contain different items as shown in Table 1. In our research we have used metadata related to, how user performs his/her tasks? So, association rule mining may be helpful in uncovering useful rules and associations among the documents which are related to or relevant to the task which user is going to perform.

ARM is briefly explained below for understanding how we used it to find associations among items. Let = , ,… , be a set of items. The attribute values form an item set. Considering D to be a database and T to be transactions such that T D, where each transaction T is a set of items such that T I, an association rule is an expression of the form: A B, where A and B are sets of items that belong to I. To further illustrate the concept of association rule mining, consider the example of a Telecom Company. Consider customers who use services (set of items) called a “transaction". The transactions made by customers are shown in Table 2. The manager is interested to know which set of services are used together to better manage the services and customers. Considering transactions mentioned in Table 2, association rule mining leads to the rule: sms-service call-service which describes that customers who use sms-service also use call-service. There may be other rules but all may not be of interest to the users. For example, the rule sms-service GPRS-service is also formed but association between sms-service and GPRS-service is not as interesting as between sms-service and call-service. The rules that do not meet a minimum threshold are considered to be uninteresting [8].

TABLE II. EXAMPLE TO ILLUSTRATE ASSOCIATION RULE MINING

Transactions Items

T1 sms-service, call-service

T2 sms-service

T3 GPRS-service, call-service, sms-service

T4 Other, GPRS-service, call-service, sms-service

Two basic measures for association rules are “support" and “Confidence", which reflect the usefulness and certainty of the rules discovered respectively [8]. These two measures are defined as:

( )= . &

(a)

( )= . &

. (b)

In Table 2, the rule sms-service call- service has support

of = 75% and confidence of = 75% , and the rule call-

service GPRS-service has support of = 50% and

confidence of = 66.7% . The calculated values show that the

rule sms-service call-service is stronger (i.e. more interesting) as compared to call-service GPRS-service.

978-1-4577-0768-1/11/$26.00 ©2011 IEEE

4

Association rule mining may uncover thousands of rules

that are uninteresting to the user, so constraint based association mining may be performed to restrict the rules. The constraints may be knowledge type constraints, data constraints, and dimension constraints, interestingness constraints (i.e. to specify thresholds on support and confidence measures) or rule constraints [8].

Rule constraints specify the form of rules to be mined and are expressed as metarules. For example, we may want to associate customers' different characteristics with the use of sms-service. The metarule is like: P(X, Y) uses(X, “sms- service"), where X is a variable representing a customer andY is the value of the attribute assigned to predicate P. The data mining system can then search for rules that match the given metarule. For instance, rule location (X, “Islamabad") uses (X, “sms-service") matches with the above metarule, so metarule may help to form a hypothesis regarding the relationships that may be of interest to users.

In Interestingness constraints, user specifies a threshold

value on interestingness measures like support and confidence. For example, user may be interested to find rules whose confidence value is greater than 50%. Data mining system searches for rules which meet these conditions.

The documents which are accessed by user at a particular time for performing specific task may be heterogeneous in nature, semantically those documents may not be directly related to each other, but user may need to recall the same documents whenever he/she reviews his/her past activities. In this case task based correlation may helps to identify the appropriate documents even if they are semantically heterogeneous. For example when a student from bioinformatics reads a scientific document about gene sequence and transcription factors, he/she may need to go through certain string matching algorithm. For this purpose user needs to read documents related to pattern matching algorithm which may not be directly related to the ongoing task of that user. In future user may need to review same documents again to recall his/her knowledge, so activity oriented association among personal documents may help users to identify their documents on the basis of their previous activities.

IV. EXPERIMENTAL SETUP

Our work is finding correlation among the documents using

association rule mining technique.

TABLE III. FILTERED DATASET AGAINST D1, D2

TID UID Doc-List S-Time E-Time Date

T100 1 D9, D18, D1,

D2 11:12AM 11:23AM 11.02.2005

T200 1 D9, D18, D97, D1

11:23AM 11:35AM 11.02.2005

TID UID Doc-List S-Time E-Time Date

T300 1 D18, D97,

D1, D2 11:35AM 11:36AM 11.02.2005

T400 1 D18, D97, D1, D2, D5

11:36AM 12:45AM 11.02.2005

…… … … … … …

…… … … … … …

Following steps are performed by our functional prototype to find associations among documents. Our assumption is, when user performs reading/writing activities on their personal computer they open some related documents which they require in order to perform a certain task. A task can be of reading assignment, writing reports or any research activity. So there are relationships among the items (documents) which they usually open at same time. Initially we are focusing only on the documents of type .doc, .pdf, .ppt, .xls,. rft for finding the correlation on the basis of their occurrence with each other. To achieve this goal we adopted following steps. Step 1: A volunteer student of post graduate was taken for the study. We developed an application which grabs the information from the process list of user and installed it on participant's computer system. Whenever user would open the documents, the processes manager of windows will maintain its metadata at system level. Our prototype captures this information and creates transactions whenever a document is opened or an old document is closed along with time. The time is used later on for calculating the weight which is a measure used to finding associative ranking among documents and for creating visualization. Our application grabs information against certain document types such as doc, ppt, xls, rtf, txt, and pdf.

Step 2: The log dataset contains five attributes that are transaction ID, and list of documents that are accessed at the same time, starting time of transaction, ending time of transaction and date. Sample of dataset is shown in Table 3. Because the process, that are running on user's machine, change periodically as user activity/task changes, so we get a rich dataset which is further processed in next step. Step 3: Dataset is pre-processed and transactions with one item or document are removed, and then frequent 2-itemsets are found by using Apriori algorithm with threshold of support count 2 as demonstrated in the following example.

F D, where F contains all frequent 2-itemsets and D is dataset

F F,and F = {D18,D97},F = {D2,D18},F = {D1,D2} and so on.

D is a single document which belongs to F For finding association weights for document all 2-itemsets are taken in such a way that one item in each should be the document whose association weight is to be calculated. E.g. taking 2-itemsets as S, where S F, and each set S S contains . S= {D1,D2},{D1,D18},{D1,D97} Then

978-1-4577-0768-1/11/$26.00 ©2011 IEEE

5

against each 2-itemsets in S the dataset is filtered for calculating the association weights for S as shown in Table 3. Then for calculating the association weight for S , total access time is used in conjunction with confidence

, where is access time of each 2-itemsets in S;

= ( ) ( ) (1)

is calculated using formula

= | = (2)

is association weight calculated by multiplying confidence

with total access time .

= . 3600 + . 60+ . (3) The weight vector W that contains weights of all other documents in S for , is normalized b using normalization technique as follows:

=

(100 1)+ 1 (4)

Using the weight vector the documents are ranked and visualization is created.

When the user query is applied, the documents are retrieved using its vector space model or metadata including userID, title, type, size, access date.

Each document can further be visualized for finding its correlated documents. Frequently accessed documents with Di are listed and ranked according to the calculated weighs

Visualization is created for presenting correlated documents. The document, that is searched and selected, is placed in the center and correlated documents are aligned around it as a spider view. The correlated documents with high associating weight are aligned closer and vice versa as shown in Figure 1. This visual presentation of documents helps user to identify not only the correlated documents but also to identify which one of the documents is more associated on the basis of past activities.

For plotting the graph against following formulas are used.

= + | |

(5)

= ( ) (6)

= ( ) (7)

Equation 5 is used to align the correlated documents at equal distance across 360 degree, whereas equation 6 and 7 is used to define the strength of correlation on visual presentation.

V. RESULTS AND ANALAYSIS

Our experiments show that there were total of three hundred and fifty one documents accessed by user during the period of 131 days. We found total 2811 2-itemsets with support count = 2. Table 4 shows some generalized results of association rule mining. On average user can find 11 related documents to a particular task.

TABLE IV. RESULTS OF ASSOCIATION RULE MINING AT THE DEVELOPER END

Total 2 item sets with support count 2 2811

Documents with highest support count 267

Average support count of each document 11.14

Figure 1 show an interface of the prototype systems that has

been developed containing the spider view of the related documents to the document which is in the center. Visualization helps user in instantly identifying the documents which are more related with his/her work.

In reading activity of user, average number of association links for each document was approximately 11 on the basis of their co-occurrence. When a document is searched by user its correlated documents according to the past activities of user are presented in a way which helps user in finding the correlated documents without cognitive overload of memorizing the document he needs to review along with searched documents.

FIGURE 1. VISUAL PRESENTATION OF CORRELATED DCOUMENTS AGAINST

Our weighting scheme helps user in deciding and identifying the related documents which may be relevant to his/her tasks. This also reduces access time and will eventually help in reducing the overall task completion time of user. It also helps minimizing the extra load of user when user is working on the same task on which he has worked before and just needs the related documents within available time and

978-1-4577-0768-1/11/$26.00 ©2011 IEEE

6

wants to access related files without wasting time on searching.

One of an advantage of our proposed visualization is that user does not need to open the document every time because he can see the meta data associated with the document on just one click. Visualization of the results is easy to understand as it is always easy to perceive things when they are presented visually rather than textually. Our prototype/interface save user from hideous task of identifying documents by him on the basis of support count. However to check the usability of the interface we can experiment it in future with a sample of users with different levels of experience. This analysis however does not show that documents are relevant on the basis of their contents. But, on the other hand it provides an overall view to the user activities based on the historical data of documents usage. We believe that with the content based personalization task based personalization is also very important, and more research should be done in this regard.

VI. CONCLUSION AND FUTURE WORK

Personal information management is related to user's activity for creation, organization, searching and management of their personal data. Since the storage capacity is no more an issue in today's world that is why users store huge data in their personal computers. As the time passes recollection of the personal data becomes difficult for users especially when they try to locate documents on the basis of their past activities. We have performed an experiment on personal data access log of a user comprises 131 days of his activities.

We used association rules mining technique with a custom

weighted scheme to highlight the correlation among documents on the basis of user activity. Our results show that by average each document has approximately 11 correlated documents which are limited in number and user can easily identify the task based related documents whenever he/she goes for searching particular document. The results show that specific pattern of each document is accessible only with some specific documents, so in future when any document is to be searched by user; he/she may be able to retrieve correlated documents with some effective representation which would be helpful and save user's time in order to identify the required documents. For evaluation the prototype was then given to the same user whose access log of last 131 days was captured, and asked him to perform certain document search. Feedback from the participant was quite in the favor of prototype that the list of related documents do not directly related on the basis of their contents but he wanted to have these documents when performing the same tasks for which he has been studying earlier.

As a future work this experiment can be extended by

joining the content based correlation with activity based association in order to produce better results which may help to retrieve personal documents of user according to his/her own context and activities.

ACKNOWLEDGMENTS

We are very grateful to Dr. Onaiza Maqbool for her valuable suggestions and guidance throughout our research work. We are heartily thankful to Capt. Adil Javaid for participating in our research as volunteer. Capt. Adil has worked on the dataset used in this research work for experiment.

REFERENCES

[1] X. Bao, J. Herlocker, and T. Dietterich. Fewer clicks and less frustration: reducing the cost of reaching the right folder. In Proc. of 11th International Conference on Intelligent user Interfaces, pages 178 - 185. ACM, 2006.

[2] O. Bergman, R. Beyth-Marom, and R. Nachmias. The project fragmentation problem in personal information management. In Proc. SIGCHI, pages 271 - 274. ACM, 2006.

[3] T. Blanc-Brude and D. Scapin. What do people recall about their documents?: implications for desktop search tools. In Proc. of 12th International Conference on Intelligent user Interfaces, pages 102 - 111. ACM, 2007.

[4] S. Cohen, C. Domshlak, and N. Zwerdling. On ranking techniques for desktop search. ACM Transactions on Information Systems (TOIS), 26(2):1- 24, 2008.

[5] E. Cutrell, D. Robbins, S. Dumais, and R. Sarin. Fast, fexiblefiltering with Phlat-Personal search andorganization made easy. In Proc. SIGCHI, volume 1, pages 261- 270. Citeseer, 2006.

[6] M. Daoud, L. Tamine-Lechani, and M. Boughanem. Learning user interests for a session-based personalized search. In Proceedings of the second international symposium on Information interaction in context, pages 57- 64. ACM, 2008.

[7] D. Elsweiler and I. Ruthven. Towards task-based personal information management evaluations. In Proc. of SIGIR, pages 23- 30. ACM, 2007.

[8] J. Han and M. Kamber. Data mining: concepts and techniques. Morgan Kaufmann, 2006.

[9] S. Henderson. Personal document management strategies. In Proc. of 10th International Conference NZ Chapter of the ACM's Special Interest Group on Human-Computer Interaction, pages 69-76. ACM, 2009.

[10] W. Jones, H. Bruce, A. Foxley, and C. Munat. Planning personal projects and organizing personal information. In Proc. of American Society for Information Science and Technology, 43(1):1 - 24, 2006.

[11] W. Jones and J. Teevan. Personal information management. Univ. of Washington Pr, 2007.

[12] T. Malone. How do people organize their desks?: Implications for the design of office information systems. ACM Transactions on Information Systems (TOIS), 1(1):112, 1983.

978-1-4577-0768-1/11/$26.00 ©2011 IEEE

7

[13] B. Possas, N. Ziviani, W. Meira Jr, and B. Ribeiro-Neto. Set-based vector model: An e_cient approach for correlation-based ranking. ACM Transactions on Information Systems (TOIS), 23(4):397 - 429, 2005.

[14] X. Shen, B. Tan, and C. Zhai. Exploiting Personal Search History to Improve Search Accuracy. Personal Information Management: Now That We Are Talking, What Are We Learning?, page 94, 2006.

[15] Y. Sun, H. Li, I. Councill, J. Huang, W. Lee, and C. Giles. Personalized ranking for digital libraries based on log analysis. In Proc. of the 10th ACM workshop on Web information and data management, pages 133- 140. ACM, 2008.

[16] C. Chen, F. S.C. Tseng, and T. Liang. An Integration of WordNet and fuzzy association rule mining for multi-label document clustering. Data and Knowledge Engineering. 69(11): 1208-1226, 2010.

978-1-4577-0768-1/11/$26.00 ©2011 IEEE