text mining, word embeddings, & wikipedia

Text Mining, Word Embeddings, & Wikipedia

Muhammad Atif Qureshi

12/01/17 2

Contents

● Introduction● Text Mining

– Similar words– Word ambiguity

● Word Embedding– Related Research– Toy Example

● Wikipedia– Structure– Phrase Chunking– Case studies

12/01/17 3

Problem

● Motivation– Human beings have found a great comfort in expressing their viewpoint in writing

because of its ability to preserve thoughts for a longer period of time than oral communication.

– Textual data is a very popular means of communication over the World Wide Web in the form of data on online news websites, social networks, emails, governmental websites, etc.

● Observation

Text may contain the following complexities– Lack of contextual and background information

– Ambiguity due to more than one possible interpretation of the meaning of text

– Focus and assertions on multiple topics

12/01/17 4

Text Mining

● Motivation

With so much textual data around us especially on the World Wide Web, there is a motivation to understand the meaning of the data

● Definition

It is the process by which textual data is analyzed in order to derive high quality information on the basis of patterns

12/01/17 5

Similar Words

● Can similar words be group together as one?– Simple techniques

● Lemmatization (mapping plural to singulars, accurate but low coverage)

● Stemming (map word to a root word, inaccurate but high coverage)

– Complex technique● A word is known by the company it keeps → Word

Embeddings

12/01/17 6

Word Ambiguity

● Is Apple a company or a fruit?– “Apple tastes better than blackberry”

– “Apple phones are better than blackberry”

● Context is important– Tastes → Fruit

– Phones → Apple Inc.

12/01/17 7

Word Embedding

● Definition– It is a technique in NLP that quantifies a concept

(word or phrase) as a vector of real numbers.

● Simple application scenario– How similar are two words?

– Similarity(vector(good), vector(best))

12/01/17 8

Related Research

● Word embeddings– Word2Vec

● It is a predictive model which uses two layer neural networks

– FastText● It is an extension to word2vec by Facebook

– GloVe● It is a count based model which performs dimensionality reduction on the co-

occurrence matrix

● Wikipedia based Relatedness– Semantic Relatedness Framework

● It uses Wikipedia sub-category hierarchy to measure relatedness

12/01/17 9

Toy Example → Word Embeddings● Train co-occurence matrix● Apply cosine similarity● Find vectors● Further concepts

– Dimestionality Reduction

– Window size

– Filter words

12/01/17 10

Word Analogies

● Man is to Woman, King is to ____ ?● London is to England, Islamabad is to

____ ?● Using vectors, we can say

– King – Man + Woman → Queen

– Islamabad – London + England → Pakistan

12/01/17 11

Why Wikipedia for Text Mining?● One of the largest encyclopedia● Free to use● Collaboratively and actively updated

12/01/17 12

Wikipedia

● Each article has a title that identifies a concept.

● Each article contains content that defines a particular concept textually.

● Each article is mentioned inside different categories

– E.g., article ‘Espresso’ is mentioned inside ‘Coffee drinks’, ‘Italian cuisine’, etc.

● Each Wikipedia category generally contains parent and children categories.

– E.g., ‘Italian cuisine’ has parent categories ‘Italian culture’, ‘Cuisine by nationality’, etc

– E.g., ‘Italian cuisine’ has children categories ‘Italian desserts ’, ‘Pizza’, etc

12/01/17 13

C1A1

A3

A4

C3C2

C4

C5 C6 C7

C10

C9

Category Article

Category Edge Article Belonging to Category

A2

Article Link

Wikipedia Category Graph Structure along with Wikipedia Articles

Wikipedia Graph Structure

12/01/17 14

Example of Wikipedia Category Structure

academic_disciplines

science

interdisciplinary_fields

scientific_disciplines

behavioural_sciences

society

social_sciencesscience_studies

information_technology

information

sociologyinformation_science

Truncated Wikipedia Category Graph

12/01/17 15

Phrase Chunking using Wikipedia

i prefer samsung s5 over htc, apple, nokia because it is economical and good.

i prefer samsung s5 over htc apple nokia because it is economical and good

Phrase chunking using phraseboundaries

Longest phrase that matches withWikipedia Article Title or Redirect(which is not a stopword)

samsung s5prefer htc apple

nokia economical

overi because it

and goodis

Removed stopwords Extracted phrases

I prefer Samsung S5 over HTC, Apple, Nokia because it is economical and good.

Conversion into lowercase

12/01/17 16

Word Embedding using Wikipedia● We can find more complex relationships

due to– Article-Category Graph structure

– Multi-lingual relations

– Infobox, birth, age, etc

12/01/17 17

Wikipedia Documents

PhraseChunking

RelatednessCalculator

Wikipedia ArticleTitle or Redirect Stream of

Text

Candidate Phrases

Wikipedia Category-Article Structure

Online ReputationManagement Tasks

Perspective AwareSearch Engine

RelatednessScores

Wikipedia Based Semantic Relatedness Framework

12/01/17 18

Perspective Aware Approach to Search

● Problem: The result set from a search engine (Google, Bing, Yahoo) for any user's query may have an inherent perspective given issues with the search engine or issues with the underlying collection.

● PAS is system that allows users to specify at query time a perspective together with their query.

● The system allows the users to quickly surmise the presence of the perspective in the returned set.

12/01/17 19


● Perspective is modelled by making use of Wikipedia articles-categories graph structure– Perspective: activism

– Wikipedia fetches articles defining activism by looking into category graph structure

12/01/17 20


12/01/17 21

Keyword Extraction via Identification of Domain-Specific Keywords

Title of Web Pages

Wikipedia Articles& Redirects

IntersectedPhrases

Community DetectionAlgorithm

WikipediaCategory

Graph

Domain-Specific Phrases

Identifies readable phrases

Domain-Specific Single Terms

Merging both

Domain-Specific Keywords

By exploiting Wikipedia Article-Category Structure

● Problem: Given a collection of document titles from different school websites, we extract domain specific keywords for the entire website that represent the domain.

● Example: “Information Retrieval”, “Science”

12/01/17 22

Innovation in Automotive

Red → Probability 1.0Green → Probability 0.5White → Probability 0.0

Size represents how much a category is mentioned inside the dataset`

12/01/17 23

Python Snippet for the Usage of the WikiMadeEasy API

● wiki_client = Wiki_client_service()● print(wiki_client.process([ìsTitle', `business', 0]))● print(wiki_client.process([ìsPerson', àlbert einstein', 0]))● print(wiki_client.process([`mentionInCategories', `data mining', 0]))● print(wiki_client.process([`containsArticles', `business', 0]))● print(wiki_client.process([`matchesCategories', `pakistan', 0]))● print(wiki_client.process([`matchesArticles', `computer science', 0]))● print(wiki_client.process([`getWikiOutlinks', `pagerank', 0]))● print(wiki_client.process([`getWikiInlinks', `google', 0]))● print(wiki_client.process([`getExtendedAbstract', `pakistan', 0]))● print(wiki_client.process([`getSubCategory', `science', 0]))● print(wiki_client.process([`getSuperCategory', `science', 0]))● graph_dict = wiki_client.process([`getSubtoSuperCategoryGraph', [ìnformation_science',

`sociology'], 2])

12/01/17 24

Question

text mining, word embeddings, & wikipedia

Data & Analytics