riyadh user group - 1st meeting (dec 2016(

of 42 /42
Innovating and Being Creative with 1 st Riyadh UseR Meetup 15 th December 2016 Allure Hub, King Fahd Road : https://goo.g l/O4rnv3 : https://goo.g l/gx0iWX

Upload: ali-arsalan-kazmi

Post on 20-Mar-2017

92 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Riyadh UseR Group - 1st Meeting (Dec 2016(

Innovating and Being Creative

with

1st Riyadh UseR Meetup15th December 2016Allure Hub, King Fahd Road

: https://goo.gl/O4rnv3: https://goo.gl/gx0iWX

Page 2: Riyadh UseR Group - 1st Meeting (Dec 2016(

Ali Kazmi

http://goo.gl\IcwGiB

https://goo.gl/eBGEKd

@scac1041

https://goo.gl/tiWAMm

Page 3: Riyadh UseR Group - 1st Meeting (Dec 2016(

• Meeting

Page 4: Riyadh UseR Group - 1st Meeting (Dec 2016(

• Meeting• UseR Group

Page 5: Riyadh UseR Group - 1st Meeting (Dec 2016(

• Meeting• UseR Group• Meetup Group

Page 6: Riyadh UseR Group - 1st Meeting (Dec 2016(

• Meeting• UseR Group• Meetup Group

• 1st things 1st…

Page 7: Riyadh UseR Group - 1st Meeting (Dec 2016(

Objectives

Page 8: Riyadh UseR Group - 1st Meeting (Dec 2016(

Objectives

Promote Usage of R• Statistical Data Analysis

Tool• General Purpose

Programming Tool

Promote Computational Thinking

Promote Creativity with R

Enable Riyadh useRs to become ‘Data Analytic Citizens’

Page 9: Riyadh UseR Group - 1st Meeting (Dec 2016(

Objectives

Promote Usage of R• Statistical Data Analysis

Tool• General Purpose

Programming Tool

Promote Computational Thinking

Promote Creativity with R

Enable Riyadh useRs to become ‘Data Analytic Residents’

Page 10: Riyadh UseR Group - 1st Meeting (Dec 2016(

Content Coverage

Page 11: Riyadh UseR Group - 1st Meeting (Dec 2016(

Content Coverage

• Commercial Settings– Use cases for Commercial work

• Personal Settings– Use cases for possibly non-Commercial/Private

work

Page 12: Riyadh UseR Group - 1st Meeting (Dec 2016(

Structure of UseR Meetup Team

1. Ali Kazmi (Organiser)2. _________3. _________

Not a one man show, please.

Page 13: Riyadh UseR Group - 1st Meeting (Dec 2016(

Today’s Presentations

• Personal Setting– Data Journalism with R and Stylometry: Identifying

number of writers for a Prime Minister's speeches

• Commercial Setting– Data de-duplication: Analysing misspelled names

to identify which refer to the same person

Page 14: Riyadh UseR Group - 1st Meeting (Dec 2016(

Using Stylometry to Identify Authorship of Texts

Page 15: Riyadh UseR Group - 1st Meeting (Dec 2016(

A series of events prompt the Pakistani Prime Minister to address the nation…

Page 16: Riyadh UseR Group - 1st Meeting (Dec 2016(

A speech is delivered...

And, thereafter, an Audio clip is leaked, showing the PM taking advice on writing style

Page 17: Riyadh UseR Group - 1st Meeting (Dec 2016(

Journalists wondered if the PM takes advice on writing style for important speeches only….

…Are some other speeches also a product of such brainstorming sessions?

Page 18: Riyadh UseR Group - 1st Meeting (Dec 2016(

Media wondered if the PM takes advice on writing style for important speeches only….

…Are some other speeches also a product of such brainstorming sessions?

How can we answer this?

Page 19: Riyadh UseR Group - 1st Meeting (Dec 2016(

Media wondered if the PM takes advice on writing style for important speeches only….

…Are some other speeches also a product of such brainstorming sessions?

How can we answer this?

Page 20: Riyadh UseR Group - 1st Meeting (Dec 2016(

Stylometry is Linguistics + Statistics applied to detect stylistic changes in text

Page 21: Riyadh UseR Group - 1st Meeting (Dec 2016(

Stylometry is Linguistics + Statistics applied to detect stylistic changes in text

Assumption of Stylometry: Each writer has a distinct style of writing that is unconsciously learnt and used.

Page 22: Riyadh UseR Group - 1st Meeting (Dec 2016(

Various aspects of text can capture Stylistic variation:

• Punctuation Markers

• Length of a sentence

• Vocabulary Richness

• Parts of Speech

• Function Words

; , . !

Actually I don’t think that it is good because of the fact that this is not the…

It behoves me to accomplish this work.

Verb, Noun, Adjective, Adverb, Conjunction, etc.

That, but, therefore, and, etc.

What characterises a person’s writing style?

Page 23: Riyadh UseR Group - 1st Meeting (Dec 2016(

Applications

• J. K. Rowling & Galbraith

• Writing Style in Novels

Page 24: Riyadh UseR Group - 1st Meeting (Dec 2016(

Roadmap

• Extract

• Quantify

• Analyse

• Visualise

Multi-Dimensional Scaling, PCA, Bootstrap Consensus Trees

Page 25: Riyadh UseR Group - 1st Meeting (Dec 2016(
Page 26: Riyadh UseR Group - 1st Meeting (Dec 2016(
Page 27: Riyadh UseR Group - 1st Meeting (Dec 2016(

Traditional Journalism vs. Data Journalism

• Traditional Journalism

• Data Journalism

Page 28: Riyadh UseR Group - 1st Meeting (Dec 2016(

Considerations in Stylometry

• Size of dataset/corpus

• Open World Problem

• Relatively new field

Page 29: Riyadh UseR Group - 1st Meeting (Dec 2016(

Questions?

Page 30: Riyadh UseR Group - 1st Meeting (Dec 2016(

Data De-duplication: Analysing misspelled names to identify which refer to the same person

Page 31: Riyadh UseR Group - 1st Meeting (Dec 2016(

Client approaches us for analysing transactional data with reference to contact names

1

Page 32: Riyadh UseR Group - 1st Meeting (Dec 2016(

Client approaches us for analysing transactional data with reference to contact names

1

2Typos, variation in names…

Hamza Sheikh vs. Humza Shaikh vs. Hamza Sheik vs. Hazma Shiekh

Page 33: Riyadh UseR Group - 1st Meeting (Dec 2016(

Client approaches us for analysing transactional data with ref. to contacts

1

2

Typos, variation in names…

- Hundreds of Thousands of records - 5 Days

What to do?

Page 34: Riyadh UseR Group - 1st Meeting (Dec 2016(

Problem and Solution Elicitation

• Pattern of ‘errors’– Typing Mistakes– Minor Displacement of letters

• Solution– Pattern Matching ~ Risky, Time-consuming– String Matching Algorithms

Page 35: Riyadh UseR Group - 1st Meeting (Dec 2016(

String Matching Algorithms

• stringdist package in R

• Edit-based distance measures– Includes:

• Deletion • Addition• Substitution • Transposition

– Generally: • Edit a string,• count iterations of edit• Less iterations = less distance = similar names!

Page 36: Riyadh UseR Group - 1st Meeting (Dec 2016(

Examples of Edit-Based Measures

How many Insertions to obtain a particular text?

Duba Duba➜ i

How many Substitutions to obtain a particular text?

Tony ➜ Rony

How many Deletions to obtain a particular text

Swisss Swiss ➜

How many Transpositions to obtain a particular text?

Toyn To➜ ny

Greater the amount of edits to text, greater the dissimilarity of two text strings

Page 37: Riyadh UseR Group - 1st Meeting (Dec 2016(

String Similarity MetricsSimilarity Metric Substitution Deletion Insertion Transposition

Longest Common Substring

Levenshtein

Damerau – Levenshtein

Jaro – Winkler

Soundex NA NA NA NA

Jaro – Winkler is a heuristic measure for typos. Designed to implement penalty if characters at remote positions are changed, as these are probably not typos – they occur due to transpositions at similar positions in a string.

Talha vs. Tahla Talha vs. Lahaat

Soundex checks phonetic similarity for English words.

Page 38: Riyadh UseR Group - 1st Meeting (Dec 2016(

Application & Results

Similarity measures applied to relevant

columns

Using each similarity measure, records with the highest similarity

identified as duplicates and merged

4,243 unique donors found!

Page 39: Riyadh UseR Group - 1st Meeting (Dec 2016(

• Can be quite expensive!– Memory insufficiency (with R)– Computationally time-consuming

Consideration

Page 40: Riyadh UseR Group - 1st Meeting (Dec 2016(

Questions?

Page 41: Riyadh UseR Group - 1st Meeting (Dec 2016(

• Stylometry for Data Journalism– Actual Study– Short Presentation

• Names’ De-duplication– Confidential

Links to Presented Work

Page 42: Riyadh UseR Group - 1st Meeting (Dec 2016(

Should you like to Network now: Go ahead!

Otherwise: Thanks for joining this session!

Networking & Conclusion