metody logiczne w analizie danych

47
Tolerance Rough Set Model and its Applications in Web Intelligence Hung Son Nguyen Institute of Mathematics, Warsaw University [email protected] Atlanta, Nov 17, 2013

Upload: data-science-warsaw

Post on 08-Aug-2015

122 views

Category:

Education


2 download

TRANSCRIPT

Page 1: Metody logiczne w analizie danych

Tolerance Rough Set Model and itsApplications in Web Intelligence

Hung Son Nguyen

Institute of Mathematics, Warsaw [email protected]

Atlanta, Nov 17, 2013

Page 2: Metody logiczne w analizie danych

Outline

1 Introduction

2 Tolerance Rough Sets Model (TRSM)

3 Clustering of Web Search Results

4 Extended TRSM

5 SONCA

6 Conclusions

Page 3: Metody logiczne w analizie danych

Outline

1 Introduction

2 Tolerance Rough Sets Model (TRSM)

3 Clustering of Web Search Results

4 Extended TRSM

5 SONCA

6 Conclusions

Page 4: Metody logiczne w analizie danych

Rough set approach to Conceptapproximations

• Lower approximation – we are sure that these objects are in the set.• Upper approximation - it is possible (likely, feasible) that these objectsbelong to our set (concept). They roughly belong to the set.

AX

AX

X

U

Page 5: Metody logiczne w analizie danych

Generalized definition

Rough approximation of the concept C (induced by a sample X):any pair P = (L,U) satisfying the following conditions:

1 L ⊆ U ⊆ U ;2 L,U are subsets of U expressible in the language L2;3 L ∩X ⊆ C ∩X ⊆ U ∩X;

4 (∗) the set L is maximal (and U is minimal) in the family of setsdefinable in L satisfying (3).

Rough membership function of concept C:any function f : U → [0, 1] such that the pair (Lf ,Uf ), where• Lf = {x ∈ U : f(x) = 1} and• Uf = {x ∈ U : f(x) > 0}.

is a rough approximation of C (induced from sample U)

Page 6: Metody logiczne w analizie danych

Generalized definition

Rough approximation of the concept C (induced by a sample X):any pair P = (L,U) satisfying the following conditions:

1 L ⊆ U ⊆ U ;2 L,U are subsets of U expressible in the language L2;3 L ∩X ⊆ C ∩X ⊆ U ∩X;

4 (∗) the set L is maximal (and U is minimal) in the family of setsdefinable in L satisfying (3).

Rough membership function of concept C:any function f : U → [0, 1] such that the pair (Lf ,Uf ), where• Lf = {x ∈ U : f(x) = 1} and• Uf = {x ∈ U : f(x) > 0}.

is a rough approximation of C (induced from sample U)

Page 7: Metody logiczne w analizie danych

Example of Rough Set models

• Standard rough sets defined by attributes:• lower and upper approximation of X by attributes from B are defined

by indiscernible classes.• Tolerance based rough sets:

• Using tolerance relation (also similarity relation) instead ofindiscernibility relation.

• Variable Precision Rough Sets (VPRS)• allowing some admissible level 0 ≤ β ≤ 1 of classification inaccuracy.

• Generalized approximation space:

Page 8: Metody logiczne w analizie danych

Outline

1 Introduction

2 Tolerance Rough Sets Model (TRSM)

3 Clustering of Web Search Results

4 Extended TRSM

5 SONCA

6 Conclusions

Page 9: Metody logiczne w analizie danych

TRSM- Tolerance Rough Sets Model

• Let D = {d1, d2, . . . , dN} be a set of documents andT = {t1, t2, . . . , tM} set of index terms for D

• TRSM is an approximation space R = (T, Iθ, ν, P ) determined overthe set of terms T as follows:

• Tolerance classes of terms: (uncertain parameterized function by athreshold θ)

Iθ(ti) = {tj | fD(ti, tj) ≥ θ} ∪ {ti}

where fD(ti, tj) = |{d ∈ D : d contains both ti and tj}|• Vague inclusion function: For ti ∈ T , X ⊆ T :

µ(ti, X) = ν(Iθ(ti), X) =|Iθ(ti) ∩X||Iθ(ti)|

• Structural function: all tolerance classes of terms are considered asstructural subsets: P (Iθ(ti)) = 1 for all ti ∈ T .

Page 10: Metody logiczne w analizie danych

Tolerance classes

d1

d2

d3

t1

t2

t3

t4

t5

t6

t1

t2

t3

t4

t5

t6

t1

t2

t3

t4

t5

t6

Page 11: Metody logiczne w analizie danych

Example: tolerance classes

Term Tolerance classes for a query “jaguar” using 200results (returned by Google) and θ = 9

Documentfrequency

Atari Atari, Jaguar 10Mac Mac, Jaguar, OS, X 12onca onca, Jaguar, Panthera 9Jaguar Atari, Mac, onca, Jaguar, club, Panthera, new,

information, OS, site, Welcome, X, Cars185

club Jaguar, club 27Panthera onca, Jaguar, Panthera 9new Jaguar, new 29information Jaguar, information 9OS Mac,Jaguar, OS, X 15site Jaguar, site 19Welcome Jaguar, Welcome 21X Mac, Jaguar, OS, X 14Cars Jaguar, Cars 24

Page 12: Metody logiczne w analizie danych

• In context of Information Retrieval, a tolerance class represents aconcept that is characterized by terms it contains.

• By varying the threshold θ (e.g., relatively to the size of documentcollection), one can control the degree of relatedness of words intolerance classes (or the preciseness of the concept represented by atolerance class).

• Finally, the lower and upper approximations of any subset X ⊆ T canbe determined — with the obtained tolerance R = (T, Iθ, ν, P ) —respectively as

LR(X) = {ti ∈ T | ν(Iθ(ti), X) = 1};

UR(X) = {ti ∈ T | ν(Iθ(ti), X) > 0}

Page 13: Metody logiczne w analizie danych

Enriching document representation

• Let di = {ti1 , ti2 , ..., tik} be a document in D.• A “richer” representation of di can be achieved by its upperapproximation in TRSM, i.e.,

UR(di) = {ti ∈ T | ν(Iθ(ti), di) > 0}

• Extended TF*IDF weighting scheme:

wnewij =

(1 + log(fdi(tj)) ∗ log N

fD(tj)if tj ∈ di

mintk∈di wik ∗log N

fD(tj)

1+log NfD(tj)

if tj ∈ UR(di)\di

0 if tj /∈ UR(di)

where wij is the standard TF*IDF weight for term tj in document di.

Page 14: Metody logiczne w analizie danych
Page 15: Metody logiczne w analizie danych

Title: EconPapers: Rough sets bankruptcy prediction models versus auditorDescription: Rough sets bankruptcy prediction models versus auditor signallingrates. Journal of Forecasting, 2003, vol. 22, issue 8, pages 569-586. ThomasE. McKee. ...

original vector using upper approximationTerm Weight Term Weightauditor 0.567 auditor 0.564bankruptcy 0.4218 bankruptcy 0.4196signalling 0.2835 signalling 0.282EconPapers 0.2835 EconPapers 0.282rates 0.2835 rates 0.282versus 0.223 versus 0.2218issue 0.223 issue 0.2218Journal 0.223 Journal 0.2218MODEL 0.223 MODEL 0.2218prediction 0.1772 prediction 0.1762Vol 0.1709 Vol 0.1699

applications 0.0809Computing 0.0643

Page 16: Metody logiczne w analizie danych

Outline

1 Introduction

2 Tolerance Rough Sets Model (TRSM)

3 Clustering of Web Search Results

4 Extended TRSM

5 SONCA

6 Conclusions

Page 17: Metody logiczne w analizie danych

Clustering web search results

1 Searching on the web is tedious and time-consuming:• search engines can not index the huge and highly dynamic web contain,• the user’s “intention behind the search” is not clearly expressed which

results in too general, short queries;

2 Results returned by search engine can count from hundreds tohundreds of thousands of documents.

3 Clustering of search results = grouping similar snippets together:• facilitate presentation of results in more compact form• enable thematic browsing of the results set.

Page 18: Metody logiczne w analizie danych

Clustering web search results

1 Searching on the web is tedious and time-consuming:• search engines can not index the huge and highly dynamic web contain,• the user’s “intention behind the search” is not clearly expressed which

results in too general, short queries;

2 Results returned by search engine can count from hundreds tohundreds of thousands of documents.

3 Clustering of search results = grouping similar snippets together:• facilitate presentation of results in more compact form• enable thematic browsing of the results set.

Page 19: Metody logiczne w analizie danych
Page 20: Metody logiczne w analizie danych

Snippet clustering problems

• Poor representation of snippets can result low correlation betweendocuments and document clusters;

• Except good quality clusters, it is also required to produce meaningful,concise description for cluster;

• The algorithm must be fast to process results on-line.

Page 21: Metody logiczne w analizie danych

Snippet clustering problems

• Poor representation of snippets can result low correlation betweendocuments and document clusters;

• Except good quality clusters, it is also required to produce meaningful,concise description for cluster;

• The algorithm must be fast to process results on-line.

Existing solutions:use the domain knowledge likes thesaurus or ontology to correct thesimilarity relation between snippets.

• Global thesaurus, e.g., WordNet;• Local and context relationships between terms;

Page 22: Metody logiczne w analizie danych

Example: vivisimo screenshot

Page 23: Metody logiczne w analizie danych

Rough set approach to snippet clustering

1 Approximation of similarity relation on the set of terms ⇒ tolerancerough set model (TRSM);

2 Enriching document representation using upper approximation ofsnippets in TRSM;

3 Clustering the enriched representations of snippets

Page 24: Metody logiczne w analizie danych

Tolerance Rough set Clusteringalgorithm:

1 documents preprocessing: In TRC, the following standardpreprocessing steps are performed on snippets: text cleansing, textstemming, and Stop-words elimination.

2 documents representation building: two main procedures indexterm selection and term weighting are performed.

3 tolerance class generation: see next slide

4 clustering: k-mean clustering on the enriched documentrepresentations; use nearest-neighbor to assign unclassified documentsto cluster.

5 cluster labeling: phrase labeling.

Page 25: Metody logiczne w analizie danych

Step 3: Tolerance class generation

Page 26: Metody logiczne w analizie danych

Step 4: Clustering

The set of index terms Rk representing cluster Ck is constructed so that:• each document di in Ck share some or many terms with Rk• terms in Rk occurs in most documents in Ck• terms in Rk needs not to be contained by every document in Ck

The weighting for terms tj in Rk is calculated as an averaged weight of alloccurrences in documents of Ck:

wj(Rk) =

∑di∈Ck

wij

|{di ∈ Ck | tj ∈ di}|

Page 27: Metody logiczne w analizie danych

Outline

1 Introduction

2 Tolerance Rough Sets Model (TRSM)

3 Clustering of Web Search Results

4 Extended TRSM

5 SONCA

6 Conclusions

Page 28: Metody logiczne w analizie danych

Extended TRSM using thesaurus

The extended TRSM is an approximation space RC = (T ∪ C, Iθ,α, ν, P ),where C is the mentioned above set of concepts.• for each term ci ∈ C the set Iθ,α(ci) contains α top terms from thebag of terms of ci calculated from the textual descriptions of concepts.

• for each term ti ∈ T the set Iθ,α(ti) = Iθ(ti) ∪ Cα(ti) consists of thetolerance class of ti from the standard TRSM and the set of concepts,whose description contains the term ti as the one of the top α terms.

In extended TRSM, the document di ∈ D is represented by

URC(di) = UR(di) ∪ {cj ∈ C | ν(Iθ,α(cj), di) > 0} =

⋃tj∈di

Iθ,α(ti)

Page 29: Metody logiczne w analizie danych

d1

d2

d3

t1

t2

t3

t4

t5

t6

c1

c2

t1

t2

t3

t4

t5

t6

t1

t2

t3

t4

t5

t6

c1

c2

c1

c2

Challenge:How to define the weighting schema?

Page 30: Metody logiczne w analizie danych

Example: Explicit Semantic Analysis

Page 31: Metody logiczne w analizie danych

Semantic indexing of Medical documents

Page 32: Metody logiczne w analizie danych

Semantic indexing of Medical documents

Top 20 concepts:“Low Back Pain", “Pain Clinics",“Pain Perception", “Treatment Out-come", “Sick Leave", “Outcome As-sessment (Health Care)", “ControlledClinical Trials as Topic", “ControlledClinical Trial", “Lost to Follow-Up",“Rehabilitation, Vocational", “PainMeasurement", “Pain, Intractable",“Cohort Studies", “Randomized Con-trolled Trials as Topic", “Neck Pain",“Sickness Impact Profile", “ChronicDisease", “Comparative EffectivenessResearch", “Pain, Postoperative"...

Page 33: Metody logiczne w analizie danych

Experiment results

• Ontology: Medical SubjectHeadings (MeSH)

• Data Set: Pubmed Central• Expert tags: documents inPubmed Central are taggedby human experts usingheadings and (optionally)accompanying subheadings(qualifiers).

• A single document istypically tagged by 10 to 18heading-subheading pairs.

• Quality Measure: Rand Index

Page 34: Metody logiczne w analizie danych

Outline

1 Introduction

2 Tolerance Rough Sets Model (TRSM)

3 Clustering of Web Search Results

4 Extended TRSM

5 SONCA

6 Conclusions

Page 35: Metody logiczne w analizie danych

SONCA-Search based on ONtologies andCompound Analytics

• A platform developed at the Faculty of Mathematics, Informatics andMechanics of the University of Warsaw.

• This is an interface for intelligent algorithms identifying relationsbetween various types of objects.

• It extends typical functionality of scientific search engines by moreaccurate identification of relevant documents and more advancedsynthesis of information.

• Concurrent processing of documents coupled with ability to producecollections of new objects using queries specific for analytic databasetechnologies.

Page 36: Metody logiczne w analizie danych

The challenging query

“Provide a summary of the current state of knowledge about X”

where X = a condition, a chemical compound, or a patient’s history, etc.

• The subtasks:

• identify the concept X,• find the most related pieces of information,• construct a summary that could be returned in a variety of formats,

including figures, tables, links to original sources, and so on.• Other useful queries may look like:

• “Give me a definition of X”;• “What are the most related research problems?”;• “When was it first used in the literature?”;• “Which academic units work on a given problem?”;• “Which workshops and conferences discuss the topic?”.

Ability to answer such queries would doubtlessly benefit the community.

Page 37: Metody logiczne w analizie danych

The challenging query

“Provide a summary of the current state of knowledge about X”

where X = a condition, a chemical compound, or a patient’s history, etc.

• The subtasks:• identify the concept X,

• find the most related pieces of information,• construct a summary that could be returned in a variety of formats,

including figures, tables, links to original sources, and so on.• Other useful queries may look like:

• “Give me a definition of X”;• “What are the most related research problems?”;• “When was it first used in the literature?”;• “Which academic units work on a given problem?”;• “Which workshops and conferences discuss the topic?”.

Ability to answer such queries would doubtlessly benefit the community.

Page 38: Metody logiczne w analizie danych

The challenging query

“Provide a summary of the current state of knowledge about X”

where X = a condition, a chemical compound, or a patient’s history, etc.

• The subtasks:• identify the concept X,• find the most related pieces of information,

• construct a summary that could be returned in a variety of formats,including figures, tables, links to original sources, and so on.

• Other useful queries may look like:• “Give me a definition of X”;• “What are the most related research problems?”;• “When was it first used in the literature?”;• “Which academic units work on a given problem?”;• “Which workshops and conferences discuss the topic?”.

Ability to answer such queries would doubtlessly benefit the community.

Page 39: Metody logiczne w analizie danych

The challenging query

“Provide a summary of the current state of knowledge about X”

where X = a condition, a chemical compound, or a patient’s history, etc.

• The subtasks:• identify the concept X,• find the most related pieces of information,• construct a summary that could be returned in a variety of formats,

including figures, tables, links to original sources, and so on.

• Other useful queries may look like:• “Give me a definition of X”;• “What are the most related research problems?”;• “When was it first used in the literature?”;• “Which academic units work on a given problem?”;• “Which workshops and conferences discuss the topic?”.

Ability to answer such queries would doubtlessly benefit the community.

Page 40: Metody logiczne w analizie danych

The challenging query

“Provide a summary of the current state of knowledge about X”

where X = a condition, a chemical compound, or a patient’s history, etc.

• The subtasks:• identify the concept X,• find the most related pieces of information,• construct a summary that could be returned in a variety of formats,

including figures, tables, links to original sources, and so on.• Other useful queries may look like:

• “Give me a definition of X”;• “What are the most related research problems?”;• “When was it first used in the literature?”;• “Which academic units work on a given problem?”;• “Which workshops and conferences discuss the topic?”.

Ability to answer such queries would doubtlessly benefit the community.

Page 41: Metody logiczne w analizie danych
Page 42: Metody logiczne w analizie danych

SONCA functionalities

• Source unification:• Document collections, e.g. PubMed,

• Semantic indexing using• Wikipedia• Ontological domain knowledge, e.g.

• MeSH - Medical Subject Heading,• MSC – Mathematics Subject Classification,• OECD science classification, etc.

• Multilingual processing:• English,• Polish and• extendable for other languages;

• Online clustering of search results subjected to• documents• authors• another types of objects

Page 43: Metody logiczne w analizie danych

Outline

1 Introduction

2 Tolerance Rough Sets Model (TRSM)

3 Clustering of Web Search Results

4 Extended TRSM

5 SONCA

6 Conclusions

Page 44: Metody logiczne w analizie danych

Conclusions

• This paper is a part of the ongoing project, called SONCA (Searchbased on ONtologies and Compound Analytics)

• The proposed so far methods are quite efficient• Our preliminary experiments lead to several promising conclusions. Thefuture plans are briefly outlined as follows:

• extend the experiments with different semantic indexing methods thatare currently designed for SONCA system,

• analyse label quality of clusters resulting from different documentrepresentations,

• conduct experiments using other extensions (e.g. citations along withtheir context; information about authors, institutions, fields ofknowledge or time),

• visualization of clustering results.

Page 45: Metody logiczne w analizie danych

Conclusions

• This paper is a part of the ongoing project, called SONCA (Searchbased on ONtologies and Compound Analytics)

• The proposed so far methods are quite efficient

• Our preliminary experiments lead to several promising conclusions. Thefuture plans are briefly outlined as follows:

• extend the experiments with different semantic indexing methods thatare currently designed for SONCA system,

• analyse label quality of clusters resulting from different documentrepresentations,

• conduct experiments using other extensions (e.g. citations along withtheir context; information about authors, institutions, fields ofknowledge or time),

• visualization of clustering results.

Page 46: Metody logiczne w analizie danych

Conclusions

• This paper is a part of the ongoing project, called SONCA (Searchbased on ONtologies and Compound Analytics)

• The proposed so far methods are quite efficient• Our preliminary experiments lead to several promising conclusions. Thefuture plans are briefly outlined as follows:

• extend the experiments with different semantic indexing methods thatare currently designed for SONCA system,

• analyse label quality of clusters resulting from different documentrepresentations,

• conduct experiments using other extensions (e.g. citations along withtheir context; information about authors, institutions, fields ofknowledge or time),

• visualization of clustering results.

Page 47: Metody logiczne w analizie danych

THANK YOU

FOR YOUR KIND ATTENTIONS