classification and clustering methods development and implementation for unstructured documents...
DESCRIPTION
Classification and clustering methods development and implementation for unstructured documents collections. by Osipova Nataly St.Petesburg State University Faculty of Applied Mathematics and Control Processes Department of Programming Technology. Contents. Introduction - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/1.jpg)
Classification and clustering methods development and implementation for unstructured documents collections
byOsipova Nataly
St.Petesburg State University Faculty of Applied Mathematics and Control Processes Department of Programming Technology
![Page 2: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/2.jpg)
Contents
IntroductionMethods descriptionInformation Retrieval SystemExperiments
![Page 3: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/3.jpg)
Contextual Document Clustering
was developed in joined project ofApplied Mathematics and Control Processes Faculty, St. Petersburg State University and Northern Ireland Knowledge Engineering Laboratory (NIKEL), University of Ulster.
![Page 4: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/4.jpg)
Definitions
DocumentTerms dictionaryDictionaryClusterWord contextContext or document conditional
probability distributionEntropy
![Page 5: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/5.jpg)
Document conditional probability distribution
Document x
yword1 word2 word3 …wordn
tf(y)5106
16
p(y|x)5/m10/m6/m
16/m
y – wordstf(y) – y frequencyp(y|x) – y conditional probability in document xm – document x size
(5/m, 10/m,6/m,…,16/m ) – document conditional probability distribution
![Page 6: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/6.jpg)
Word context
Word wDocument x1 Document x2 Document xk
yword1 word2 …wordn1
tf(y)510
16
p(y|x1)5/m110/m1
16/m1
yword1 word3 …wordn2
tf(y)712
4
p(y|x1)7/m112/m1
4/m1
yword1 word4 …wordnk
tf(y)209
3
p(y|x1)20/mk9/mk
3/mk
…
yword1 word2 word3 …wordnk
tf(y)5+7+20=321012
3
p(y|w)32/m10/m12/m
3/m
…
Context conditional probability distribution
![Page 7: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/7.jpg)
Contents
IntroductionMethods descriptionInformation Retrieval SystemExperiments
![Page 8: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/8.jpg)
Methods
document clustering methoddictionary build methodsdocument classification method using
training set
Information retrieval methods:keyword search methodcluster based search methodsimilar documents search method
![Page 9: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/9.jpg)
Contextual Documents Clustering
Documents Dictionary Narrow context words
Clusters
Distances calculation
![Page 10: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/10.jpg)
Entropy
(HH
n
i
pipi1
)log(*)
p1 pnp2
y context conditional probability distribution
p1+p2+…+pn=1
p1 pnp2
Uncertainly measure, here it is used to characterize commonness (narrowness) of the word context.
![Page 11: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/11.jpg)
Contextual Document Clustering
maxH(y)=H (
)
![Page 12: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/12.jpg)
Entropy
α0 10.5
)2(log2 1, 21 pp
)loglog(]),([ 221121 ppppppH
H( ) H( ) H( )
![Page 13: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/13.jpg)
Word Context - Document Distance
1p
2p
21 21
21 ppp
y context conditional probability distribution
Document x conditional probability distribution
Average conditional probability distribution
![Page 14: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/14.jpg)
Word Context - Document Distance
JS[p1,p2]=H( )
- 0.5H( )
- 0.5H( )
![Page 15: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/15.jpg)
Jensen-Shannon divergence
210]2,1[
0]2,1[
},{
},{
21
21
21
21
ppppJS
ppJS
![Page 16: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/16.jpg)
Dictionary construction
Why:- big volumes: 60,000 documents, 50,000 words => 15,000
words in a context- narrow context words importance
![Page 17: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/17.jpg)
Dictionary construction
Delete words with1. High or low frequency2. High or low document frequency3. 1. and 2.
![Page 18: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/18.jpg)
Retrieval algorithms
keyword search methodcluster based search methodsearch by example method
![Page 19: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/19.jpg)
Keyword search method
Document 1word 1word 2word 3…word n1
Document 2word 10word 25word 30…word n2
Document 3word 15word 2word 32…word n3
Document 4word 11word 21word 3…word n4
Request: word 2 Result set: document 1document3
![Page 20: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/20.jpg)
Cluster based search method
Documents
Cluster 3word 1word 23…word n3
Documents Documents
Cluster 2word 12word 26…word n2
Cluster 1word 1word 2…word n1
Cluster context words
Request: word 1 Result set: Cluster 1Cluster 3
![Page 21: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/21.jpg)
Similar documents search
document 1Cluster name
Cluster
Minimal Spanning Tree
document 2
document 3
document 4
document 5
document 6
document 7
Request: document 3
Result set: document 6document 7
![Page 22: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/22.jpg)
Document classification: method 1
Clusters List of topics Training set
Topics contexts
Distances between topics and clusters contexts
Classification result:cluster1 – topic 10cluster 2 – topic 3
…cluster n – topic 30
Test documents
![Page 23: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/23.jpg)
Clusters
Topics listTraining set
Classification result:cluster1 – topic 10cluster 2 – topic 3
…cluster n – topic 30
Document classification: method 2
Test documents
All documents set
![Page 24: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/24.jpg)
Contents
IntroductionMethods descriptionInformation Retrieval SystemExperiments
![Page 25: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/25.jpg)
Information Retrieval System
ArchitectureFeaturesUse
![Page 26: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/26.jpg)
Information Retrieval System architecture.
data base serverclient
![Page 27: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/27.jpg)
IRS architecture
Data Base
Data Base ServerMS SQL Server 2000
Local AreaNetwork
“thick” clientC#
![Page 28: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/28.jpg)
IRS architecture
DBMS MS SQL Server 2000:High-performanceScalableSecureHuge volumes of data treatT/SQLStored procedures
![Page 29: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/29.jpg)
IRS features
In the IRS the following problems are solved:document clusteringkeyword search methodcluster based search methodsimilar documents search methoddocument classification with the use of
training set
![Page 30: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/30.jpg)
DB structure
The Data Base of the IRS consists of the following tables: documents all words dictionary dictionary table of relations between documents and words: document-word words contexts words with narrow contexts clusters intermediate tables for main tables build and for retrieve realization
![Page 31: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/31.jpg)
DictionaryDocuments
Table “document-word”
Words contexts
Clusters CentroidCluster based search
Keyword search
Words with narrow contexts
All words dictionary
Similar documents search
Algorithms implementation
![Page 32: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/32.jpg)
document1document2
document5 document3
document4
Cluster
0,16285
0,98154
0,57231
0,23851
0,26967
0,211
0,87310,7231
0,1011
Similar documents search
![Page 33: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/33.jpg)
Minimal Spanning Tree
document 1
Cluster name
Cluster
document 2
document 3
document 4
document 5
![Page 34: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/34.jpg)
Similar documents search
Clusterstable Tree tableDistances
table
Similar documents
search
![Page 35: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/35.jpg)
IRS use
![Page 36: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/36.jpg)
IRS use
![Page 37: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/37.jpg)
IRS use
![Page 38: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/38.jpg)
IRS use
![Page 39: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/39.jpg)
IRS use
![Page 40: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/40.jpg)
IRS use
![Page 41: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/41.jpg)
Contents
IntroductionMethods descriptionInformation Retrieval SystemExperiments
![Page 42: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/42.jpg)
Experiments
Test goals were:algorithm accuracy testdifferent classification methods
comparisonalgorithm efficiency evaluation
![Page 43: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/43.jpg)
Experiments
60,000 documents100 topicsTraining set volume = 5% of the
collection size
![Page 44: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/44.jpg)
Experiments
1000)(,2)( ydfydf
1000)(,5)( ytfytf
![Page 45: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/45.jpg)
Result analysis
- Russian Information Retrieval Evaluation Seminar
- Such measures as macro-average recallprecision F-measure were calculated.
![Page 46: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/46.jpg)
Recall
textan
xxxxxxxxxxxx
xxxx
xxxx
xxxxxxxx
xxxx
xxxx
0
0.1
0.2
0.3
0.4
0.5
0.6
Systems
textan
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
Recall
![Page 47: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/47.jpg)
Precision
xxxx
xxxxxxxx
xxxx xxxx
xxxx
textanxxxxxxxx
xxxx
00.10.20.30.40.50.60.7
Systems
textan
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
Precision
![Page 48: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/48.jpg)
F-measure
textan
xxxx
xxxxxxxx
xxxxxxxx
xxxxxxxx
xxxx
xxxx
00.050.1
0.150.2
0.250.3
0.35
Systems
textan
xxxx
xxxx
xxxx
xxxx
xxxx
xxxx
F-measure
![Page 49: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/49.jpg)
Result analysis
List of some topicstest documents were classified in
№ Category
1 Family law
2 Inheritance law
3 Water industry
4 Catering
5 Inhabitants’ consumer services
6 Rent truck
7 International law of the space
8 Territory in international law
9 Off-economic relations fellows
10 Off-economic dealerships
11 Economy free trade zones. Customs unions.
![Page 50: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/50.jpg)
Result analysis
Recall results for every category.Results which were the best for the category are selected with bold type.All results are set in percents.
СV 1 2 3 4 5 6 7 8 9 10 11
textan 33 34 35 60 46 26 27 98 75 25 100
xxxx 1 0 0.2 3 4 0 0.9 0 3 0 2
xxxx 0 0 4.3 2.3 0 5 0.9 8 3 0 0.8
xxxx 55 86 75 19 59 51 80 0 41 82 0
xxxx 21 39 2 22 15 6 0 1.4 0 5 0
xxxx 40 43 16 11 25 23 10 1.4 1.2 5 0
xxxx 23 4 2.5 1.1 18 7 0.9 0 1.2 10 0
xxxx 2.7 0 0 0 1.5 0 0 0 0 0 0
xxxx 2.2 0 0 0 1.5 0 0 0 0 0 0
xxxx 37 21 12 22 18 27 51 0 0 0 0
![Page 51: Classification and clustering methods development and implementation for unstructured documents collections](https://reader035.vdocuments.mx/reader035/viewer/2022062411/56816820550346895dddb058/html5/thumbnails/51.jpg)
Thank you for your attention!