image source [1] -...
TRANSCRIPT
Image Source [1]
Image Sources [1], [2]
DocClustering Extension
26.01.2015
Hauptseminar Information Retrieval
Ramin Safarpour, Muhammad El-Hindi Image Source [2]
Future Work
Demo
Challenges
Concept & Tools
Motivation
Image Source [1] 4
Labeling A document is easy…
5 Image Sources [1], [3]
Labeling A document is easy…
How about
MANY?
6 Image Sources [1], [3]
Clustering: Reveals
inherent structure
7 Image Sources [1]
Another piece to the puzzle
MediaWiki
8 Image Sources [1]
Future Work
Demo
Challenges
Concept & Tools
Motivation
Image Source [1] 9
10
Core
API
Front End
MediaWiki architecture
(Content processing, Users, Caching, DB)
(WebAPI, Client libraries)
(User Interface)
11 http://www.mediawiki.org/wiki/MediaWiki
Core
API
Front End
MediaWiki architecture
Hooks
API
Special Pages
(Content processing, Users, Caching, DB)
(WebAPI, Client libraries)
(User Interface)
12 http://www.mediawiki.org/wiki/MediaWiki
IR-Modell
D1, D2, D3, D4, D5, …
Representation Vectors
Pre-processing
𝑽
V1, V2, V3, V4, V5, …
Documents 𝑫
Clustering D5, …
D6, …
D1, D2
D3, D4
Trans-formation
Clusters 𝑪𝒊 ⊆ 𝑫
13
API API
DocClustering Extension
Core Hooks Front End
Special Pages
Php-NLP-Tools
Preprocessor Feature-Creator Classes
Cluster-Pages Special-Pages
DB-Connector
MediaWiki Layer
Extension Layer
Utils Layer
14
“NlpTools
is a library for natural language processing written in php.”
php-nlp-tools.com
To
ke
niz
e
Stemm
Document- Representation
Sto
p-
Wo
rds
Clustering
An
aly
ze
15 http://php-nlp-tools.com/
“NlpTools
is a library for natural language processing written in php.”
php-nlp-tools.com
To
ke
niz
e
Stemm
Document- Representation
Sto
p-
Wo
rds
Clustering
An
aly
ze
To
ke
niz
e
Stemm
Document- Representation
Sto
p-
Wo
rds
Clustering
An
aly
ze
Loose Coupling Loose Coupling
16 http://php-nlp-tools.com/
http://php-nlp-tools.com/ 17
IR-Modell
D1, D2, D3, D4, D5, …
Representation Vectors
Pre-processing
𝑽
V1, V2, V3, V4, V5, …
Documents 𝑫
Clustering D5, …
D6, …
D1, D2
D3, D4
Trans-formation
Clusters 𝑪𝒊 ⊆ 𝑫
18
Special Characters / Tags
Tokenization
Normalization
Stemming
Pre-Processing Stack
Stopwords
19 Image Source [1]
IR-Modell
D1, D2, D3, D4, D5, …
Representation Vectors
Pre-processing
𝑽
V1, V2, V3, V4, V5, …
Documents 𝑫
Clustering D5, …
D6, …
D1, D2
D3, D4
Trans-formation
Clusters 𝑪𝒊 ⊆ 𝑫
20
Word Occu(w, doc1)
Tf(w, doc1)
Tfidf(w, doc1)
Player 5 5/9 0
Munich 3 3/9 0
Train 1 1/9 0
Parliament 0 0 0
∑ 9 1
Feature Creation
Term Occurrence
21
𝑜𝑐𝑐𝑢𝑟𝑤𝑜𝑟𝑑,𝑑𝑜𝑐 = 𝑓𝑟𝑒𝑞𝑤,𝑑
Term Frequency
𝑡𝑓𝑤𝑜𝑟𝑑,𝑑𝑜𝑐 = 𝑓𝑟𝑒𝑞𝑤,𝑑
𝑓𝑟𝑒𝑞𝑤,𝑑𝑤∈𝑑
TF-IDF
𝑡𝑓𝑖𝑑𝑓𝑤𝑜𝑟𝑑,𝑑𝑜𝑐 = 𝑡𝑓𝑤,𝑑 ∙ 𝑖𝑑𝑓𝑤
𝑖𝑑𝑓𝑤𝑜𝑟𝑑 = 𝑙𝑜𝑔𝑁
𝑛𝑤
Word Occu(w, doc1)
Tf(w, doc1)
Tfidf(w, doc1)
Player 5 5/9 5/9*log(2)
Munich 3 3/9 0
Train 1 1/9 0
Parliament 0 0 0
∑ 9 1
Feature Creation
Term Occurrence
22
𝑜𝑐𝑐𝑢𝑟𝑤𝑜𝑟𝑑,𝑑𝑜𝑐 = 𝑓𝑟𝑒𝑞𝑤,𝑑
Term Frequency
𝑡𝑓𝑤𝑜𝑟𝑑,𝑑𝑜𝑐 = 𝑓𝑟𝑒𝑞𝑤,𝑑
𝑓𝑟𝑒𝑞𝑤,𝑑𝑤∈𝑑
TF-IDF
𝑡𝑓𝑖𝑑𝑓𝑤𝑜𝑟𝑑,𝑑𝑜𝑐 = 𝑡𝑓𝑤,𝑑 ∙ 𝑖𝑑𝑓𝑤
𝑖𝑑𝑓𝑤𝑜𝑟𝑑 = 𝑙𝑜𝑔𝑁
𝑛𝑤
Occu(w, doc1)
Tf(w, doc1)
Tfidf(w, doc1)
0 0/10 0
5 5/10 0
2 2/10 0
3 3/10 3/10*log(2)
10 1 Do
cum
en
t 3
?
…
Tfidf(w, doc1)
?
?
?
?
Tfidf(w, doc1)
?
?
?
?
Stored Features
23
Word Occu(w, doc1)
Tf(w, doc1)
Player 5 5/9
Munich 3 3/9
Train 1 1/9
Parliament 0 0
∑ 9 1
Occu(w, doc1)
Tf(w, doc1)
0 0/10
5 5/10
2 2/10
3 3/10
10 1
Image Source [1]
Tfidf(w, doc1)
?
?
?
?
Tfidf(w, doc1)
?
?
?
?
Stored Features
24
Word Occu(w, doc1)
Tf(w, doc1)
Player 5 5/9
Munich 3 3/9
Train 1 1/9
Parliament 0 0
∑ 9 1
Occu(w, doc1)
Tf(w, doc1)
0 0/10
5 5/10
2 2/10
3 3/10
10 1
Image Source [1]
IR-Modell
D1, D2, D3, D4, D5, …
Representation Vectors
Pre-processing
𝑽
V1, V2, V3, V4, V5, …
Documents 𝑫
Clustering D5, …
D6, …
D1, D2
D3, D4
Trans-formation
Clusters 𝑪𝒊 ⊆ 𝑫
25
How many???
26
How many???
27
How many???
28
Recipe
𝜺
DBSCAN – Clustering Density-Based Spatial Clustering of Applications with Noise
Size of search area 𝜀 =
Min. #neighbors e.g. 4
𝒎𝒊𝒏𝑷𝒐𝒊𝒏𝒕𝒔 =
Core point
29
Recipe
𝜺
DBSCAN – Clustering Density-Based Spatial Clustering of Applications with Noise
Size of search area 𝜀 =
Min. #neighbors e.g. 4
𝒎𝒊𝒏𝑷𝒐𝒊𝒏𝒕𝒔 =
Repeat for all unvisited points
Core point
Border point
30
Recipe
𝜺
DBSCAN – Clustering Density-Based Spatial Clustering of Applications with Noise
Size of search area 𝜀 =
Min. #neighbors e.g. 4
𝒎𝒊𝒏𝑷𝒐𝒊𝒏𝒕𝒔 =
Repeat for all unvisited points
Core point
Border point
Noise point
31
Future Work
Demo
Challenges
Concept & Tools
Motivation
32 Image Source [1]
Challenges
33 Image Source [4]
Challenges
34
Clustering
Transformation
Preprocessing
Image Source [4]
35
Clustering – the wrong choice?
- Varying #clusters - Handles noise
- Not accurate ->High dim. data ->Varying density
DBSCAN
Clique - Varying #clusters - Handles noise
- Comp. expensive ->backtracking
K-Means - Field-tested - Handles huge data
- Fixed #clusters - Sensitive to noise
Parameter selection
Transformation – the right representation?
36
𝑡𝑓𝑤𝑜𝑟𝑑,𝑑𝑜𝑐 = 𝑓𝑟𝑒𝑞𝑤,𝑑
𝑓𝑟𝑒𝑞𝑤,𝑑𝑤∈𝑑
Vs. It depends.
Transformation – the right representation?
37
𝑡𝑓𝑤𝑜𝑟𝑑,𝑑𝑜𝑐 = 𝑓𝑟𝑒𝑞𝑤,𝑑
𝑓𝑟𝑒𝑞𝑤,𝑑𝑤∈𝑑
𝒕𝒇𝒘𝒐𝒓𝒅,𝒅𝒐𝒄 = 𝒇𝒓𝒆𝒒𝒘,𝒅
𝒎𝒂𝒙𝒘{𝒇𝒓𝒆𝒒𝒘,𝒅}
𝑡𝑓𝑤𝑜𝑟𝑑,𝑑𝑜𝑐 = 𝑓𝑟𝑒𝑞𝑤,𝑑
𝑓𝑟𝑒𝑞𝑤,𝑑𝑤∈𝑑
𝑡𝑓𝑖𝑑𝑓𝑤𝑜𝑟𝑑,𝑑𝑜𝑐 = 𝑜𝑐𝑐𝑢𝑟𝑤,𝑑 ∙ 𝑖𝑑𝑓𝑤 𝒕𝒇𝒊𝒅𝒇𝒘𝒐𝒓𝒅,𝒅𝒐𝒄 = 𝒕𝒇𝒘,𝒅 ∙ 𝒊𝒅𝒇𝒘
𝟏
𝟑𝟎𝟎
𝟏
𝟑𝟎
Vs.
𝟑𝟎 1
It depends.
Preprocessing – the best corpus?
38
Corpus Size
Language
Seman tics
Preprocessing – the best corpus?
39
Corpus Size
Language
Seman tics
Named Entities
Dates
POS – filter nouns
Tags
Infoboxes
“FC Bayern München”
1990,1991,…
[[Link]]
Headings
DEMO
40 Image Source [1]
Future Work
Demo
Challenges
Concept & Tools
Motivation
41 Image Source [1]
Future Work – On-The-Fly-Clustering
42 Image Source [5]
Future Work – On-The-Fly-Clustering
43
Guestimate (Sampling)
Use Headings only
Optimize Code
External-Framework (Spark)
Image Source [5]
Future Work – Sub-Clusters
44
Sports Team sports
Soccer Indoor soccer
Futsal Image Source [6], "Photo: Lachlan Fearnley"
Future Work –
45
Ball sports
Soccer clubs
City
University
Ship
Image Sources [3]
MediaWiki Extension - Hook: Preprocess Documents - API: Cluster - SepcialPage: Show
- Works partially
WikiClustering - DBSCAN failed - k-Means works -> needs improvment - By all means not trivial!
46
Summary
Questions?
47 Image Source [1]
Image References / Attributions
• [1] PowerPoint ClipArt Search http://insertmedia.office.microsoft.com/
• [2] http://commons.wikimedia.org/wiki/Category:High-resolution_official_Wikimedia_logos
• [3] http://pixabay.com/en/tag-price-yellow-blank-308409/ • [4] Oceancetaceen, Goldene Leiter des Forums in Duisburg,
http://commons.wikimedia.org/wiki/File:Goldene_Leiter.JPG
• [5] http://mrg.bz/DUEfWy • [6] L. Fearnley, Russian Dools,
http://commons.wikimedia.org/wiki/File:Russian_Dolls.jpg
48
Links and Literature
• http://www.mediawiki.org/wiki/MediaWiki • http://php-nlp-tools.com/
• Pang-Ning Tan et al. (2005), Introduction to Data Mining,
Chapter 8 “Cluster Analysis”, Addison-Wesley • Martin Ester et al. (1996). Simoudis, Evangelos; Han, Jiawei;
Fayyad, Usama M., eds. "A density-based algorithm for discovering clusters in large spatial databases with noise". Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press. pp. 226–231.
• Gabriel Valiente (2002). Algorithms on Trees and Graphs. Berlin / Heidelberg / New York: Springer-Verlag.