text mining, word embeddings, & wikipedia
TRANSCRIPT
![Page 1: Text mining, word embeddings, & wikipedia](https://reader034.vdocuments.mx/reader034/viewer/2022052405/587e01301a28abe11a8b4ae5/html5/thumbnails/1.jpg)
Text Mining, Word Embeddings, & Wikipedia
Muhammad Atif Qureshi
![Page 2: Text mining, word embeddings, & wikipedia](https://reader034.vdocuments.mx/reader034/viewer/2022052405/587e01301a28abe11a8b4ae5/html5/thumbnails/2.jpg)
12/01/17 2
Contents
● Introduction● Text Mining
– Similar words– Word ambiguity
● Word Embedding– Related Research– Toy Example
● Wikipedia– Structure– Phrase Chunking– Case studies
![Page 3: Text mining, word embeddings, & wikipedia](https://reader034.vdocuments.mx/reader034/viewer/2022052405/587e01301a28abe11a8b4ae5/html5/thumbnails/3.jpg)
12/01/17 3
Problem
● Motivation– Human beings have found a great comfort in expressing their viewpoint in writing
because of its ability to preserve thoughts for a longer period of time than oral communication.
– Textual data is a very popular means of communication over the World Wide Web in the form of data on online news websites, social networks, emails, governmental websites, etc.
● Observation
Text may contain the following complexities– Lack of contextual and background information
– Ambiguity due to more than one possible interpretation of the meaning of text
– Focus and assertions on multiple topics
![Page 4: Text mining, word embeddings, & wikipedia](https://reader034.vdocuments.mx/reader034/viewer/2022052405/587e01301a28abe11a8b4ae5/html5/thumbnails/4.jpg)
12/01/17 4
Text Mining
● Motivation
With so much textual data around us especially on the World Wide Web, there is a motivation to understand the meaning of the data
● Definition
It is the process by which textual data is analyzed in order to derive high quality information on the basis of patterns
![Page 5: Text mining, word embeddings, & wikipedia](https://reader034.vdocuments.mx/reader034/viewer/2022052405/587e01301a28abe11a8b4ae5/html5/thumbnails/5.jpg)
12/01/17 5
Similar Words
● Can similar words be group together as one?– Simple techniques
● Lemmatization (mapping plural to singulars, accurate but low coverage)
● Stemming (map word to a root word, inaccurate but high coverage)
– Complex technique● A word is known by the company it keeps → Word
Embeddings
![Page 6: Text mining, word embeddings, & wikipedia](https://reader034.vdocuments.mx/reader034/viewer/2022052405/587e01301a28abe11a8b4ae5/html5/thumbnails/6.jpg)
12/01/17 6
Word Ambiguity
● Is Apple a company or a fruit?– “Apple tastes better than blackberry”
– “Apple phones are better than blackberry”
● Context is important– Tastes → Fruit
– Phones → Apple Inc.
![Page 7: Text mining, word embeddings, & wikipedia](https://reader034.vdocuments.mx/reader034/viewer/2022052405/587e01301a28abe11a8b4ae5/html5/thumbnails/7.jpg)
12/01/17 7
Word Embedding
● Definition– It is a technique in NLP that quantifies a concept
(word or phrase) as a vector of real numbers.
● Simple application scenario– How similar are two words?
– Similarity(vector(good), vector(best))
![Page 8: Text mining, word embeddings, & wikipedia](https://reader034.vdocuments.mx/reader034/viewer/2022052405/587e01301a28abe11a8b4ae5/html5/thumbnails/8.jpg)
12/01/17 8
Related Research
● Word embeddings– Word2Vec
● It is a predictive model which uses two layer neural networks
– FastText● It is an extension to word2vec by Facebook
– GloVe● It is a count based model which performs dimensionality reduction on the co-
occurrence matrix
● Wikipedia based Relatedness– Semantic Relatedness Framework
● It uses Wikipedia sub-category hierarchy to measure relatedness
![Page 9: Text mining, word embeddings, & wikipedia](https://reader034.vdocuments.mx/reader034/viewer/2022052405/587e01301a28abe11a8b4ae5/html5/thumbnails/9.jpg)
12/01/17 9
Toy Example → Word Embeddings● Train co-occurence matrix● Apply cosine similarity● Find vectors● Further concepts
– Dimestionality Reduction
– Window size
– Filter words
![Page 10: Text mining, word embeddings, & wikipedia](https://reader034.vdocuments.mx/reader034/viewer/2022052405/587e01301a28abe11a8b4ae5/html5/thumbnails/10.jpg)
12/01/17 10
Word Analogies
● Man is to Woman, King is to ____ ?● London is to England, Islamabad is to
____ ?● Using vectors, we can say
– King – Man + Woman → Queen
– Islamabad – London + England → Pakistan
![Page 11: Text mining, word embeddings, & wikipedia](https://reader034.vdocuments.mx/reader034/viewer/2022052405/587e01301a28abe11a8b4ae5/html5/thumbnails/11.jpg)
12/01/17 11
Why Wikipedia for Text Mining?● One of the largest encyclopedia● Free to use● Collaboratively and actively updated
![Page 12: Text mining, word embeddings, & wikipedia](https://reader034.vdocuments.mx/reader034/viewer/2022052405/587e01301a28abe11a8b4ae5/html5/thumbnails/12.jpg)
12/01/17 12
Wikipedia
● Each article has a title that identifies a concept.
● Each article contains content that defines a particular concept textually.
● Each article is mentioned inside different categories
– E.g., article ‘Espresso’ is mentioned inside ‘Coffee drinks’, ‘Italian cuisine’, etc.
● Each Wikipedia category generally contains parent and children categories.
– E.g., ‘Italian cuisine’ has parent categories ‘Italian culture’, ‘Cuisine by nationality’, etc
– E.g., ‘Italian cuisine’ has children categories ‘Italian desserts ’, ‘Pizza’, etc
![Page 13: Text mining, word embeddings, & wikipedia](https://reader034.vdocuments.mx/reader034/viewer/2022052405/587e01301a28abe11a8b4ae5/html5/thumbnails/13.jpg)
12/01/17 13
C1A1
A3
A4
C3C2
C4
C5 C6 C7
C10
C9
Category Article
Category Edge Article Belonging to Category
A2
Article Link
Wikipedia Category Graph Structure along with Wikipedia Articles
Wikipedia Graph Structure
![Page 14: Text mining, word embeddings, & wikipedia](https://reader034.vdocuments.mx/reader034/viewer/2022052405/587e01301a28abe11a8b4ae5/html5/thumbnails/14.jpg)
12/01/17 14
Example of Wikipedia Category Structure
academic_disciplines
science
interdisciplinary_fields
scientific_disciplines
behavioural_sciences
society
social_sciencesscience_studies
information_technology
information
sociologyinformation_science
Truncated Wikipedia Category Graph
![Page 15: Text mining, word embeddings, & wikipedia](https://reader034.vdocuments.mx/reader034/viewer/2022052405/587e01301a28abe11a8b4ae5/html5/thumbnails/15.jpg)
12/01/17 15
Phrase Chunking using Wikipedia
i prefer samsung s5 over htc, apple, nokia because it is economical and good.
i prefer samsung s5 over htc apple nokia because it is economical and good
Phrase chunking using phraseboundaries
Longest phrase that matches withWikipedia Article Title or Redirect(which is not a stopword)
samsung s5prefer htc apple
nokia economical
overi because it
and goodis
Removed stopwords Extracted phrases
I prefer Samsung S5 over HTC, Apple, Nokia because it is economical and good.
Conversion into lowercase
![Page 16: Text mining, word embeddings, & wikipedia](https://reader034.vdocuments.mx/reader034/viewer/2022052405/587e01301a28abe11a8b4ae5/html5/thumbnails/16.jpg)
12/01/17 16
Word Embedding using Wikipedia● We can find more complex relationships
due to– Article-Category Graph structure
– Multi-lingual relations
– Infobox, birth, age, etc
![Page 17: Text mining, word embeddings, & wikipedia](https://reader034.vdocuments.mx/reader034/viewer/2022052405/587e01301a28abe11a8b4ae5/html5/thumbnails/17.jpg)
12/01/17 17
Wikipedia Documents
PhraseChunking
RelatednessCalculator
Wikipedia ArticleTitle or Redirect Stream of
Text
Candidate Phrases
Wikipedia Category-Article Structure
Online ReputationManagement Tasks
Perspective AwareSearch Engine
RelatednessScores
Wikipedia Based Semantic Relatedness Framework
![Page 18: Text mining, word embeddings, & wikipedia](https://reader034.vdocuments.mx/reader034/viewer/2022052405/587e01301a28abe11a8b4ae5/html5/thumbnails/18.jpg)
12/01/17 18
Perspective Aware Approach to Search
● Problem: The result set from a search engine (Google, Bing, Yahoo) for any user's query may have an inherent perspective given issues with the search engine or issues with the underlying collection.
● PAS is system that allows users to specify at query time a perspective together with their query.
● The system allows the users to quickly surmise the presence of the perspective in the returned set.
![Page 19: Text mining, word embeddings, & wikipedia](https://reader034.vdocuments.mx/reader034/viewer/2022052405/587e01301a28abe11a8b4ae5/html5/thumbnails/19.jpg)
12/01/17 19
Perspective Aware Approach to Search
● Perspective is modelled by making use of Wikipedia articles-categories graph structure– Perspective: activism
– Wikipedia fetches articles defining activism by looking into category graph structure
![Page 20: Text mining, word embeddings, & wikipedia](https://reader034.vdocuments.mx/reader034/viewer/2022052405/587e01301a28abe11a8b4ae5/html5/thumbnails/20.jpg)
12/01/17 20
Perspective Aware Approach to Search
![Page 21: Text mining, word embeddings, & wikipedia](https://reader034.vdocuments.mx/reader034/viewer/2022052405/587e01301a28abe11a8b4ae5/html5/thumbnails/21.jpg)
12/01/17 21
Keyword Extraction via Identification of Domain-Specific Keywords
Title of Web Pages
Wikipedia Articles& Redirects
IntersectedPhrases
Community DetectionAlgorithm
WikipediaCategory
Graph
Domain-Specific Phrases
Identifies readable phrases
Domain-Specific Single Terms
Merging both
Domain-Specific Keywords
By exploiting Wikipedia Article-Category Structure
● Problem: Given a collection of document titles from different school websites, we extract domain specific keywords for the entire website that represent the domain.
● Example: “Information Retrieval”, “Science”
![Page 22: Text mining, word embeddings, & wikipedia](https://reader034.vdocuments.mx/reader034/viewer/2022052405/587e01301a28abe11a8b4ae5/html5/thumbnails/22.jpg)
12/01/17 22
Innovation in Automotive
Red → Probability 1.0Green → Probability 0.5White → Probability 0.0
Size represents how much a category is mentioned inside the dataset`
![Page 23: Text mining, word embeddings, & wikipedia](https://reader034.vdocuments.mx/reader034/viewer/2022052405/587e01301a28abe11a8b4ae5/html5/thumbnails/23.jpg)
12/01/17 23
Python Snippet for the Usage of the WikiMadeEasy API
● wiki_client = Wiki_client_service()● print(wiki_client.process([`isTitle', `business', 0]))● print(wiki_client.process([`isPerson', `albert einstein', 0]))● print(wiki_client.process([`mentionInCategories', `data mining', 0]))● print(wiki_client.process([`containsArticles', `business', 0]))● print(wiki_client.process([`matchesCategories', `pakistan', 0]))● print(wiki_client.process([`matchesArticles', `computer science', 0]))● print(wiki_client.process([`getWikiOutlinks', `pagerank', 0]))● print(wiki_client.process([`getWikiInlinks', `google', 0]))● print(wiki_client.process([`getExtendedAbstract', `pakistan', 0]))● print(wiki_client.process([`getSubCategory', `science', 0]))● print(wiki_client.process([`getSuperCategory', `science', 0]))● graph_dict = wiki_client.process([`getSubtoSuperCategoryGraph', [`information_science',
`sociology'], 2])
![Page 24: Text mining, word embeddings, & wikipedia](https://reader034.vdocuments.mx/reader034/viewer/2022052405/587e01301a28abe11a8b4ae5/html5/thumbnails/24.jpg)
12/01/17 24
Question