topic mining

41
A SEMINAR REPORT ON “Topic Mining Over “Topic Mining Over Asynchronous Text Asynchronous Text Sequences” Sequences” SUBMITTED BY Arvind R Kolhe UNDER GUIDANCE OF Prof. S.Y.Raut

Upload: arvind-kolhe

Post on 13-Jan-2015

322 views

Category:

Technology


4 download

DESCRIPTION

Topic mining

TRANSCRIPT

Page 1: Topic mining

A

SEMINAR REPORT

ON

“Topic Mining Over“Topic Mining Over

Asynchronous TextAsynchronous Text

Sequences”Sequences”SUBMITTED BY

Arvind R Kolhe

UNDER GUIDANCE OF

Prof. S.Y.Raut

DEPARTMENT OF COMPUTER ENGINEERING

PRAVARA RURAL ENGINEERING COLLEGE, Loni - 413736

Page 2: Topic mining

Tal. Rahata, Dist. Ahmednagar, (M.S.), India

2013 – 2014

Page 3: Topic mining

Pravara Rural Education Society’sPravara Rural Engineering College,

Loni.Department of Computer

Engineering

Affiliated To University Of Pune, Pune

CERTIFICATEThis is to certify that this Seminar ReportThis is to certify that this Seminar Report

entitledentitled

“Topic Mining Over“Topic Mining Over Asynchronous TextAsynchronous Text

Sequences”Sequences”

Submitted bySubmitted byMr.Arvind R Kolhe

Roll No.03

Student of T.E. Computer Engineering

during the academic year 2013-2014. This

Page 4: Topic mining

report embodies the work carried out by the

candidate, towards partial fulfillment of Third

Year Computer Engineering conferred by the

University of Pune.

Prof. S.Y.Raut Prof. N. B. Kadu Prof. S. D. Jondhale

(Guide) (Seminar Co-ordinator) (Head of Department)

ACKNOWLEDGEMENT

I wish a true sense of gratitude to my seminar guide Prof S.Y. Raut and

H.O.D of Computer Department Prof.S.D.Jondhale who, at every discrete step

of this seminar contributed their valuable guidance and helped to solve each

and every problem that occurred during the seminar.

This is a nice opportunity for me to present a seminar titled “TOPIC

MINING OVER ASYNCHRONOUS TEXT SEQUENCES ”.

I would extend my sincere thanks to all staff members for extending

their kind support and encouragement during the preparation steps of this

seminar.

I also express my thanks to all my friends who directly or indirectly

supported me during the preparation of this seminar.

Page 5: Topic mining

Kolhe Arvind R

T.E [Computer Engg.]

Page 6: Topic mining

INDEXSR.NO

.TITLE PAGE NO.

TITLE PAGE I

CERTIFICATE II

ACKNOWLEDGEMENT III

LIST OF FIGURE IV

LIST OF TABLES V

LIST OF ACRONYMS /ABBREVATIONS VI

ABSTRACT VII

1. INTRODUCTION 1

2. LITERATURE SURVEY 3

3. OBJECTIVES / SCOPE OF SYSTEM 43.1 PROBLEM DEFINITION 4

3.2 OBJECTIVES 4

4. ARCHITECTURE 6

4.1 Extraction Module 6

4.2 Mapping Module 7

4.3 Optimization Module and Main Topic 7

5. ALGORITHM 8

5.1 Topic Extraction 8

5.1.1 Natural Language Processing Major Task 10

5.2 Time Synchronization 10

5.3 Algorithm Steps 11

5.3.1 Split the text into sentences 11

5.3.2 Pars the sentences 11

5.3.3 Select the candidate parts 11

5.3.4 Calculate the weight for each candidate topic 11

5.3.5 Select the final topic 11

6 ADVANTAGES 14

7 DISADVANTAGES 15

Page 7: Topic mining

8 APPLICATION 16

CONCLUSION 17

FUTURE SCOPE 18

APPENDIX 19

APPENDIX A:SOME IMPORTANT DEFINITIONS 19

APPENDIX B:HOW TO USE WEKA AS TOPIC MINING TOOL

21

REFERENCES 22

LIST OF FIGURES

Figure Number Figure Name Page Number

Fig 1 General overview of topic mining system 6

Fig 2 Parsing output of parser 12

Fig 3 Percentage of different results for topic

mining algorithm

12

Fig 4 Detail of topic mining

algorithm experiment

13

LIST OF TABLES

Table Number Table Name Page Number

Table 1 Word occurrence in a news report 9

Table 2 Notations used to define the term 19

Page 8: Topic mining

ABSTRACT

By optimizing the corresponding concepts, we will pick a single node among the concepts nodes which

we believe is the topic of the target text. However, a limited vocabulary problem is encountered while

mapping the keywords onto their corresponding concepts.Time stamped texts, or text sequences, are

ubiquitous in real-world applications. Multiple text sequences are often related to each other by sharing

common topics. Explore the correlation with the existence of asynchronism among multiple sequences,

i.e., documents from different sequences about the same topic may have different time stamps. Our

algorithm consists of two alternate steps: the first step extracts common topics from multiple sequences

based on the adjusted time stamps provided by the second step; the second step adjusts the time stamps of

the documents according to the time distribution of the topics discovered by the first step. We perform

these two steps alternately and after iterations a monotonic convergence of our objective function can be

guaranteed.

Page 9: Topic mining

“Topic Mining Over Asynchronous Text Sequences”

CHAPTER 1

INTRODUCTION

The growing amount of information available in Internet has attracted many

researchers to focus their works on text analysis and processing on web document. One

of the important process is to find the main topic for a particular web document, and

any other documents. Using these related concepts, we can capture the semantics

relation found among the words in the text. For example, if the extracted words from a

web document are computer and security. The mapping process will retrieve both

Computer and Security and Encryption. The ontology hierarchy helps us to identify that

the security mentioned in the web document is most probably talking about computer

security which is related to hackers rather than computer robbery. As the amount of text

available online keeps growing, it becomes increasingly difficult for people to keep

track of and locate the information of interest to them. To remedy the problem of

information overload, a robust and automated text summarizer or information extrator is

needed. Topic identification is one of two very important steps in the process of

summarizing a text; the second step is summary text generation. To discover valuable

knowledge from a text sequence, the first step is usually to extract topics from the

sequence with both semantic and temporal information, which are described by two

distributions, respectively: a word distribution describing the semantics of the topic and

a time distribution describing the topic’s intensity over time. In many real-world

applications, we are facing multiple text sequences that are correlated with each other

by sharing common topics. The method proposed therein relied on a fundamental

assumption that different sequences are always synchronous in time, or in their own

term Coordinated, which means that the common topics share the same time

distribution over different sequences. Rather, asynchronism among multiple sequences,

i.e., documents from different sequences on the same topic have different time stamps,

is actually very common in practice. For instance, in news feeds, there is no guarantee

that news articles covering the same topic are indexed by the same time stamps. There

can be hours of delay for news agencies, days for newspapers, and even weeks for

1 P.R.E.C, Computer Engg. 2013-14

Page 10: Topic mining

“Topic Mining Over Asynchronous Text Sequences”

periodicals, because some sources try to provide first-hand flashes shortly after the

incidents, while others provide more comprehensive reviews afterward. Another

example is research paper archives, where the latest research topics are closely followed

by newsletters and communications within weeks or months, then the full versions may

appear in conference proceedings, which are usually published annually, and at last in

journals, which may sometimes take more than a year to appear after submission. To

visualize it, we have the relative frequency of the occurrences of two terms warehouse

and mining.

We do not assume that given text sequences are always synchronous. Instead,

we deal with text sequences that share common topics yet are temporally asynchronous.

2 P.R.E.C, Computer Engg. 2013-14

Page 11: Topic mining

“Topic Mining Over Asynchronous Text Sequences”

CHAPTER 2

LITERARURE SURVEY

Topic mining has been extensively studied in the literature, starting with the

Topic Detection and Tracking, which aimed to find and track topics (events) in news

sequences with clustering-based techniques. In many real applications, text collections

carry generic temporal information and, thus, can be considered as text sequences. To

capture the temporal dynamics of topics, various methods have been proposed to

discover topics over time in text sequences. However, these methods were designed to

extract topics from a single sequence. A very recent work by Wang et al first proposed a

topic mining method that aimed to discover common (bursty) topics over multiple text

sequences. Their approach is different from ours because they tried to find topics that

shared common time distribution over different sequences by assuming that the

sequences were synchronous, or coordinated. Based on this premise, documents with

same time stamps are combined together over different sequences so that the word

distributions of topics in individual sequences can be discovered. As a contrast,

in our work, we aim to find topics that are common in semantics, while having

asynchronous time distributionsin different sequences.

In the literature of topic mining, time stamped text sequences are also referred to

as text streams. In this paper, we use the term sequence to distinguish it from the

concept of data stream in the theory literature.

3 P.R.E.C, Computer Engg. 2013-14

Page 12: Topic mining

“Topic Mining Over Asynchronous Text Sequences”

CHAPTER 3

OBJECTIVES

The aim is to use data fusion,data mining and knowledge discovery processes to

detect anomalies.

3.1 Problem Identification:

In several applications, text collections take generic temporal information and

hence can be measured as text sequences. To incarcerate the temporal dynamics of

topics, various methods have been proposed to find out topics over time in text

sequences. The problem of mining common topics from multiple asynchronous text

sequences and proposes an effective method to solve it. We formally define the

problem by introducing a principled probabilistic framework, based on which a

unified objective function can be derived.Time stamped texts, or text sequences, are

ubiquitous in real-world applications. Multiple text sequences are often related to

each other by sharing common topics. The correlation among these sequences

provides more meaningful and comprehensive clues for topic mining than those

from each individual sequence. However, it is nontrivial to explore the correlation

with the existence of asynchronism among multiple sequences, i.e., documents from

different sequences about the same topic may have different time stamps. MORE

and more text sequences are being generated in various forms, such as news

streams, weblog articles, emails, instant messages, research paper archives, web

forum discussion threads, and so forth. To discover valuable knowledge from a text

sequence, the first step is usually to extract topics from the sequence with both

semantic and temporal information.

3.2 Objectives:

To discover valuable knowledge from a text sequence, the first step is usually to

extract topics from the sequence with both semantic and temporal information,

which are described by two distributions, respectively: a word distribution

describing the semantics of the topic and a time distribution describing the topic’s

4 P.R.E.C, Computer Engg. 2013-14

Page 13: Topic mining

“Topic Mining Over Asynchronous Text Sequences”

intensity over time. In many real-world applications, we are facing multiple text

sequences that are correlated with each other by sharing common topics. Intuitively,

the interactions among these sequences could provide clues to derive more

meaningful and comprehensive topics than those found by using information from

each individual stream.

To address the problem of mining common topics from multiple asynchronous

text sequences. To the extent of our knowledge, this is the first attempt to solve

this problem.

To formalize our problem by introducing a principled probabilistic framework

and propose an objective function for our problem.

To develop a novel alternate optimization algorithm to maximize the objective

function with a theoretically guaranteed (local) optimum

To effectiveness and advantage of method are validated by an extensive

empirical study on two real-world data sets.

CHAPTER 4

5 P.R.E.C, Computer Engg. 2013-14

Page 14: Topic mining

“Topic Mining Over Asynchronous Text Sequences”

Architecture

The aim is to use data fusion,data mining and knowledge discovery processes to

detect anomaliesGenerally, The automatic topic mining system has three main

components: The extraction module, mapping module and optimization module.

The input of the system is a document and the output will be a node concept which

is also the predicted topic of the target document. The node concept can be in a form

of one word or more. Fig.3.1(1) shows the general overview of our topic mining

system.

Fig.1: shows the general overview of our topic mining system.

4.1 Extraction Module

The extraction module handles the process of extracting important sentences

from the document. Our method of extraction is based on the HTML tag. This is

because we believe some of the HTML tags indicate the location where the authors

may emphasise their ideas. For example, author may choose the best words to

describe his web page at the title tag. Therefore we will choose sentences or words

that become the pointers to other documents, words which are highlighted and

words located in the title tag. However, some web documents maybe lack of HTML

tags and therefore our extraction technique is not appropriate to be used. In this

6 P.R.E.C, Computer Engg. 2013-14

Page 15: Topic mining

“Topic Mining Over Asynchronous Text Sequences”

case, alternative way of extracting web document like words frequency and

positional policy are more applicable. Since in this paper we are interested in

extracting out information from the web document based on HTML tag, we

consider extracting information from non-structured document as a future work. The

final output of this module will be a list of keywords.

4.2 Mapping Module

The mapping module will take the output of the extraction module as an input.

The keywords will be mapped on the words of ontology concepts. However, there is

a possibility that the keyword may not be able to be mapped onto its corresponding

concept because there is no such concept available in the ontology. This situation

requires an alternative way to map the keyword onto the concept. The alternative

way is to use the extended concept as a “middle man” in order the mapping between

Yahoo concept and the keyword becomes possible.

4.3 Optimization Module and Main Topic

The optimization process will shrink the ontology tree into an optimized tree

where only active concepts and the intermediate active concepts are chosen. The

small size of optimized tree will be reduced to a single path. This single path is

retrieved using the Maximal Spanning Tree Algorithm (same as Minimal Spanning

Tree Algorithm). The Maximal Spanning Tree algorithm will find a path that has the

heaviest nodes. The weight of the node that the algorithm uses as the criteria to

choose a heaviest node will be the accumulated mixture weight that the node has.

The foundation of the topic identification process is frequent itemsets. In TopCat, a

frequent itemset is a group of named entities that occur together in multiple articles.

Co-occurrence of words has been shown to carry useful information. What this

information really gives us is correlated items, rather than a topic. However, we

found that correlated named entities frequently occurred within a recognizable topic

the interesting correlations enabled us to identify a topic.

7 P.R.E.C, Computer Engg. 2013-14

Page 16: Topic mining

“Topic Mining Over Asynchronous Text Sequences”

CHAPTER 5

Algorithm

In this section, we show how to solve our objective function in

through an alternate (constrained) optimization scheme.

5.1 Topic Extraction

Topic extraction from text corpus is the fundament of many topic analysis tasks,

such as topic trend prediction, opinion extraction, etc. Since hierarchical structure is

characteristics of topics, it is preferential for a topic extraction algorithm to output

the topics description with this kind of structure. The hierarchical topic structure

that is extracted by most of the current topic analysis algorithms can not provide a

meaningful description for all subtopics in the hierarchical tree. A document may

describe a topic from different aspects. Some are general, while others are in details.

For example, a Chinadaily news report titled as “Economic hubs face tough times

amid crisis” state that Guangdong and Shanghai, the two economic powerhouses of

China, have suffered from the global financial crisis and forecast even worse

prospects. To support this statement, it provides several details in service, market,

work, etc. In this article, some of the word occurrences are counted as in table 1.

Obviously, we can see that some words that describe the general aspect of the topic,

like ‘Guangdong’, ‘Shanghai’, ‘Economy’, ‘Finance’, ‘Crisis, etc. appear at a higher

frequency. However, other words that describe the detail aspect of the topic are at

lower frequency, such as “Service”, “Product” and “Job” etc.

word occurrences

Guangdong 8

Economy 7

Shanghai 4

City 4

Year 4

Finance 2

8 P.R.E.C, Computer Engg. 2013-14

Page 17: Topic mining

“Topic Mining Over Asynchronous Text Sequences”

Crisis 2

China 2

Service 2

Product 1

Market 1

Industry 1

Job 1

Table 1  Word occurrence in a news report

Although the example is simple, it provides us a clear way to recognize topic

grain. For a text corpus which contains large amount of documents, we can suppose

that word document frequency can represent the topic grain for a text corpus in a

degree.However, the kind of topic grain is still poor in reflecting the actual semantic

in text. Hence, topic granularity which is closely related to semantic characteristic,

should be explored further to provide a semantic discrimination indicator for topics.

Topics Extraction is Textalytics' solution for extracting the different elements

present in sources of information. This detection process is carried out by

combining a number of complex natural language processing techniques that allow

to obtain morphological, syntactic and semantic analyses of a text and use them to

identify different types of significant elements.The elements identified are classified

according to the following predefined categories:

Named entities: people, organizations, places, etc.

Concepts: significant keywords in the text

Time expressions

Money expressions

URIs

Phone number expressions

Other expressions: alphanumeric patterns

5.1.1 Major tasks in Natural Language Processing

9 P.R.E.C, Computer Engg. 2013-14

Page 18: Topic mining

“Topic Mining Over Asynchronous Text Sequences”

The following is a list of some of the most commonly researched tasks in

NLP in topic extraction

Natural language generation: Convert information from computer databases

into readable human language.

Natural language understanding: Natural language understanding involves

the identification of the intended semantic from the multiple possible

semantics which can be derived from a natural language expression which

usually takes the form of organized notations of natural languages concepts

Parsing: Determine the parse tree (grammatical analysis) of a given sentence.

The grammar for natural languages is ambiguous and typical sentences have

multiple possible analyses.

Sentence breaking: Find the sentence boundaries.

Topic segmentation: Given a chunk of text, separate it into segments a topic,

and identify the topic of the segment.

Information extraction (IE): This is concerned in general with the

extraction of semantic information from text.

.

5.2 Time Synchronization

Once the common topics are extracted, we match documents in all sequences to

these topics and adjust their time stamps to synchronize the sequences.

5.3 Algorithm Steps

Our algorithm consists of five different steps:

1. Split the text into sentences

2. Pars the sentences

3. Select the candidate parts

4. Calculate the weight for each candidate topic

5. Select the final topic

5.3.1 Split the text document into sentences

10 P.R.E.C, Computer Engg. 2013-14

Page 19: Topic mining

“Topic Mining Over Asynchronous Text Sequences”

The first step in our algorithm is splitting the sentences on the given text. In fact, the proposed algorithm is considered as a “divide and conquer” approach; therefore, the first step should be dividing the problem until it cannot be divided more. A sentence is a smallest text part which is capable to have a topic. Hence, we split the document into corresponding sentences. Text Splitter tool which splits a text into sentences. By performing this tool we would have a set of sentences.

5.3.2 Pars the sentencesIn this time, tries to calculate the weight for each noun and verb and then

creates all possible pairs. That may cause some overhead due to calculate the weight for some unimportant terms. Our proposed algorithm intends to pars the sentences and determines the candidate terms first to avoid any useless calculation. We believe that syntactic parts like Noun Phrase (NP) and Verb Phrase (VP) are playing most important roles to present the meaning of the sentence and therefore we should consider them instead of grammatical roles like noun and verb to identify the candidate topic for each sentence. For example, in sentence “My dog also likes eating bananas”, the parser has recognized “my dog” as an NP subject and “likes eating bananas” as the VP.

5.3.3 Select the candidate partsWe select noun phrase (NP) and the head of a Verb Phrase (VP) instead

of just pairs of nouns and noun-verb. We assume that the most important parts from a sentence are the NP’s that function as subject or complement and the head of the VP. To illustrate it, in sentence “My dog also likes eating bananas”, the phrase “my dog” is selected as the NP and “likes” is selected as the head of the VP and “bananas” as an NP complement. The combination of these three segments will be considered as candidate topic. Hence, the topic for this sentence is identified as “My dog likes bananas”. At the end of this step, we have a set of candidate topics.

5.3.4 Calculate the weight for each candidate topicAt this moment we can calculate the IDF(inverse document frequency)

for only required topic concept. To calculate idf suppose “A” calculate term frequency of a word in a particular document, and divide it by the number of total words in that document. Then take logarithm of “A” is the idf .

5.3.5 Select the final topicWhen we determine the candidate topic and its associated weight for

each sentence, we select the most weighted one and consider it as the main topic for the whole document. In case there are more than one candidate topics with greatest weight, we consider all of them as the main topic.

Example: “topic” stands for stream of terms which carry the semantic and meaning of

text inside the document. However, it is not necessarily as same as the title which

11 P.R.E.C, Computer Engg. 2013-14

Page 20: Topic mining

“Topic Mining Over Asynchronous Text Sequences”

embossed on the top of document. Therefore, one proper method to evaluate the

accuracy of topic mining could be the comparing the identified topic by Topic mining

algorithm. We assume that the most important parts from a sentence are the NP’s that

function as subject or complement and the head of the VP. To illustrate it, in sentence

“My dog also likes eating bananas”, the phrase “my dog” is selected as the NP and

“likes” is selected as the head of the VP and “bananas” as an NP complement. The

combination of these three segments will be considered as candidate topic. Hence, the

topic for this sentence is identified as “My dog likes bananas”. At the end of this step,

we have a set of candidate topics.

Fig. 2: parsing result with parser tool

Fig. 3: Percentage of different results for topic mining algorithm

12 P.R.E.C, Computer Engg. 2013-14

Page 21: Topic mining

“Topic Mining Over Asynchronous Text Sequences”

Fig. 4: Detail of automatic topic identification algorithm experiment

13 P.R.E.C, Computer Engg. 2013-14

Page 22: Topic mining

“Topic Mining Over Asynchronous Text Sequences”

CHAPTER 6

Advantages

we tackle the problem of mining common topics from multiple asynchronous text

sequences. We propose a novel method which can automatically discover and fix

potential asynchronism among sequences and consequentially extract better

common topics. The key idea of our method is to introduce a self-refinement

process by utilizing correlation between the semantic and temporal information in

the sequences. Following are the some advantages of this techonique,

It performs topic extraction and time synchronization alternately to optimize a

unified objective function.

A local optimum is guaranteed by our algorithm.

This justified the effectiveness of our method on two real-world data sets, with

comparison to a baseline method.

It able to find meaningful and discriminative topics from asynchronous text

sequences.

It significantly outperforms the baseline method, evaluated both in quality and in

quantity.

The performance of our method is robust and stable against different parameter

settings and random initialization.

14 P.R.E.C, Computer Engg. 2013-14

Page 23: Topic mining

“Topic Mining Over Asynchronous Text Sequences”

CHAPTER 6

Disadvantages

Following are some cases where topic mining over asynchronous text sequences

method may not work well

There is no correlation between the semantical and temporal information of

topics, i.e., the time distribution of any topic is random (no bursty behavior).

The temporal order of documents as given by their original time stamps varies

greatly from the temporal order of underlying topics, e.g., Topic A appears

before Topic B in one sequence, but after B in another. In either case, the better

choice would be discarding the original temporal information and treating the

text sequences as a collection of documents.

15 P.R.E.C, Computer Engg. 2013-14

Page 24: Topic mining

“Topic Mining Over Asynchronous Text Sequences”

CHAPTER 6

Applications

The technology is now broadly applied for a wide variety of web applications.

Applications can be sorted into a number of categories by analysis type or by

business function. application categories include:

Web Search Engine:One possible applications of topic mining is to utilize it for

Web search. For example, the incremental document algorithm can be applied to

a stream of Web pages returned by a search engine. Since topic mining can build

a document cluster hierarchy incrementally, a user can browse a document

cluster hierarchy instead of examining a flat list of documents. In addition, topic

ontologies can be used to suggest alternative query terms to refine the query.

Monitoring: Especially monitoring and analysis of online plain text sources

such as Internet news, blogs.

Biomedical applications: GoPubMed is a knowledge-based search engine for

biomedical texts.

Online media applications: Text mining is being used by large media

companies, such as the Tribune Company, to clarify information and to provide

readers with greater search experiences

16 P.R.E.C, Computer Engg. 2013-14

Page 25: Topic mining

“Topic Mining Over Asynchronous Text Sequences”

Conclusion

The problem of mining common topics from multiple asynchronous text sequences.

We propose a novel method which can automatically discover and fix potential

asynchronism among sequences and consequentially extract better common topics.

The key idea of our method is to introduce a self-refinement process by utilizing

correlation between the semantic and temporal information in the sequences. It

performs topic extraction and time synchronization alternately to optimize a unified

objective function. A local optimum is guaranteed by our algorithm. We justified

the effectiveness of our method on two real-world data sets, with comparison to a

baseline method. Empirical results suggest that method is able to find meaningful

and discriminative topics from asynchronous text sequences; method significantly

outperforms the baseline method, evaluated both in quality and in quantity; the

performance of our method is robust and stable against different parameter settings

and random initialization.

17 P.R.E.C, Computer Engg. 2013-14

Page 26: Topic mining

“Topic Mining Over Asynchronous Text Sequences”

Future Scope

In the future we plan to further reduce the computational complexity of our time

synchronization algorithm so that our method can be applied to real-time text stream

processing. Some cases where topic mining over asynchronous text sequences

method may not work well that are if there is no correlation between the semantical

and temporal information of topics, i.e., the time distribution of any topic is random

(no bursty behavior). As well as if temporal order of documents as given by their

original time stamps varies greatly from the temporal order of underlying topics. In

the future we plan to further reduce the computational complexity of our time

synchronization algorithm so that our method can be applied to real-time text stream

processing.

18 P.R.E.C, Computer Engg. 2013-14

Page 27: Topic mining

“Topic Mining Over Asynchronous Text Sequences”

Appendix

Appendix A: Some Important Definitions Related To Topic Mining

Here we discuss the terms related to tpic mining.following are the notations used to

define the term shown in the table.

Table 2: Notations used to define the term

Mining:Mining is the extraction of valuable something from somewhere.

Data Mining: The analysis step of the "Knowledge Discovery and Data

Mining". Data mining, an interdisciplinary subfield of computer science, is

the computational process of discovering patterns in large data sets.

information that can be used to increase revenue, cuts costs, or both. Data

mining software is one of a number of analytical tools for analyzing data. It

allows users to analyze data from many different dimensions or angles,

categorize it, and summarize the relationships identified. Technically, data

mining is the process of finding correlations or patterns among dozens of

fields in large relational databases.

Data Warehouse: a large store of data accumulated from a wide range of

sources within a company and used to guide management decisions. The

electronic storage of a large amount of information by a business.

Warehoused data must be stored in a manner that is secure, reliable, easy to

19 P.R.E.C, Computer Engg. 2013-14

Page 28: Topic mining

“Topic Mining Over Asynchronous Text Sequences”

retrieve and easy to manage. The concept of data warehousing originated in

1988 with the work of IBM researchers Barry Devlin and Paul Murphy. The

need to warehouse data evolved as computer systems became more complex

and handled increasing amounts of data.

Topic Warehouse: As the Web continues to grow as a vehicle for the

distribution of information, many news organizations are providing

newswire services through the Internet. Web news articles are composed of

hyperlinks, audio, video, images, and text. However, since not all news

stories have corresponding multimedia data, text can be a rich source of

information about the news.topic warehouse is large collection of text

streams, documents that corelated with each other.

Text Mining: Text mining, also referred to as text data mining, roughly

equivalent to text analytics, refers to the process of deriving high-quality

information from text.

Text Sequence: S is a sequence of N documents . Each document d is a

collection of words over vocabulary V and indexed by a unique time stamp t.

Common Topic: A common topic Z over text sequences is defined by a

word distribution over vocabulary V and a time distribution over time

stamps.

Asynchronism: Given M text sequences in which documents are indexed by

timestamps asynchronism means that the time stamps of the documents

sharing the same topic in different sequences are not properly aligned.

Data Sets: A dataset is a collection of data. Most commonly a dataset

corresponds to the contents of a single database table, or a single statistical

data matrix

Ontology: In computer science and information science, an ontology

formally represents knowledge as a set of concepts within a domain, using a

shared vocabulary to denote the types, properties and interrelationships of

those concepts.

20 P.R.E.C, Computer Engg. 2013-14

Page 29: Topic mining

“Topic Mining Over Asynchronous Text Sequences”

Appendix B: How to use WEKA Tool as Topic Mining Tool

Weka is a collection of machine learning algorithms for data mining tasks.Weka

supports several standard data mining tasks, more specifically, data preprocessing,

clustering, classification, regression, visualization, and feature selection. All of

Weka's techniques are predicated on the assumption that the data is available as a

single flat file or relation, where each data point is described by a fixed number of

attributes (normally, numeric or nominal attributes, but some other attribute types

are also supported). Weka provides access to SQL databases using Java Database

Connectivity and can process the result returned by a database query. It is not

capable of multi-relational data mining, but there is separate software for converting

a collection of linked database tables into a single table that is suitable for

processing using Weka. Another important area that is currently not covered by the

algorithms included in the Weka distribution is sequence modeling. The algorithms

can either be applied directly to a dataset or called from your own Java code. Weka

contains tools for data pre-processing, classification, regression, clustering,

association rules, and visualization. It is also well-suited for developing new

machine learning schemes. “We can use it as topic mining tool by treating the

data containing in datawarehouse is the text sequences, text streams,documents

containing the synchronous and asynchronous text sequences that corelated

with each other and sharing a common topics”.

21 P.R.E.C, Computer Engg. 2013-14

Page 30: Topic mining

“Topic Mining Over Asynchronous Text Sequences”

References

1. D.M. Blei and J.D. Lafferty, “Dynamic Topic Models” Proc. Int’l Conf.

Machine Learning (ICML), pp. 113-120, 2006.

2. Z. Li, B. Wang, M. Li, and W.-Y. Ma, “A Probabilistic Model for

Retrospective News Event Detection” Proc. Ann. Int’l ACM SIGIR Conf.

Research and Development in Information Retrieval (SIGIR), pp. 106-113,

2005.

3. Q. Mei and C. Zhai, “Discovering Evolutionary Theme Patterns from Text:

An Exploration of Temporal Text Mining,” Proc. ACM

4. SIGKDD Int’l Conf. “Knowledge Discovery and Data Mining (KDD)”, pp.

198-207, 2005.

5. Asuncion, P. Smyth, and M. Welling, “Asynchronous Distributed Learning of

Topic Models” pp. 81-88, 2008.

22 P.R.E.C, Computer Engg. 2013-14