riyadh user group - 1st meeting (dec 2016(
TRANSCRIPT
Innovating and Being Creative
with
1st Riyadh UseR Meetup15th December 2016Allure Hub, King Fahd Road
: https://goo.gl/O4rnv3: https://goo.gl/gx0iWX
Ali Kazmi
http://goo.gl\IcwGiB
https://goo.gl/eBGEKd
@scac1041
https://goo.gl/tiWAMm
• Meeting
• Meeting• UseR Group
• Meeting• UseR Group• Meetup Group
• Meeting• UseR Group• Meetup Group
• 1st things 1st…
Objectives
Objectives
Promote Usage of R• Statistical Data Analysis
Tool• General Purpose
Programming Tool
Promote Computational Thinking
Promote Creativity with R
Enable Riyadh useRs to become ‘Data Analytic Citizens’
Objectives
Promote Usage of R• Statistical Data Analysis
Tool• General Purpose
Programming Tool
Promote Computational Thinking
Promote Creativity with R
Enable Riyadh useRs to become ‘Data Analytic Residents’
Content Coverage
Content Coverage
• Commercial Settings– Use cases for Commercial work
• Personal Settings– Use cases for possibly non-Commercial/Private
work
Structure of UseR Meetup Team
1. Ali Kazmi (Organiser)2. _________3. _________
Not a one man show, please.
Today’s Presentations
• Personal Setting– Data Journalism with R and Stylometry: Identifying
number of writers for a Prime Minister's speeches
• Commercial Setting– Data de-duplication: Analysing misspelled names
to identify which refer to the same person
Using Stylometry to Identify Authorship of Texts
A series of events prompt the Pakistani Prime Minister to address the nation…
A speech is delivered...
And, thereafter, an Audio clip is leaked, showing the PM taking advice on writing style
Journalists wondered if the PM takes advice on writing style for important speeches only….
…Are some other speeches also a product of such brainstorming sessions?
Media wondered if the PM takes advice on writing style for important speeches only….
…Are some other speeches also a product of such brainstorming sessions?
How can we answer this?
Media wondered if the PM takes advice on writing style for important speeches only….
…Are some other speeches also a product of such brainstorming sessions?
How can we answer this?
Stylometry is Linguistics + Statistics applied to detect stylistic changes in text
Stylometry is Linguistics + Statistics applied to detect stylistic changes in text
Assumption of Stylometry: Each writer has a distinct style of writing that is unconsciously learnt and used.
Various aspects of text can capture Stylistic variation:
• Punctuation Markers
• Length of a sentence
• Vocabulary Richness
• Parts of Speech
• Function Words
; , . !
Actually I don’t think that it is good because of the fact that this is not the…
It behoves me to accomplish this work.
Verb, Noun, Adjective, Adverb, Conjunction, etc.
That, but, therefore, and, etc.
What characterises a person’s writing style?
Applications
• J. K. Rowling & Galbraith
• Writing Style in Novels
Roadmap
• Extract
• Quantify
• Analyse
• Visualise
Multi-Dimensional Scaling, PCA, Bootstrap Consensus Trees
Traditional Journalism vs. Data Journalism
• Traditional Journalism
• Data Journalism
Considerations in Stylometry
• Size of dataset/corpus
• Open World Problem
• Relatively new field
Questions?
Data De-duplication: Analysing misspelled names to identify which refer to the same person
Client approaches us for analysing transactional data with reference to contact names
1
Client approaches us for analysing transactional data with reference to contact names
1
2Typos, variation in names…
Hamza Sheikh vs. Humza Shaikh vs. Hamza Sheik vs. Hazma Shiekh
Client approaches us for analysing transactional data with ref. to contacts
1
2
Typos, variation in names…
- Hundreds of Thousands of records - 5 Days
What to do?
Problem and Solution Elicitation
• Pattern of ‘errors’– Typing Mistakes– Minor Displacement of letters
• Solution– Pattern Matching ~ Risky, Time-consuming– String Matching Algorithms
String Matching Algorithms
• stringdist package in R
• Edit-based distance measures– Includes:
• Deletion • Addition• Substitution • Transposition
– Generally: • Edit a string,• count iterations of edit• Less iterations = less distance = similar names!
Examples of Edit-Based Measures
How many Insertions to obtain a particular text?
Duba Duba➜ i
How many Substitutions to obtain a particular text?
Tony ➜ Rony
How many Deletions to obtain a particular text
Swisss Swiss ➜
How many Transpositions to obtain a particular text?
Toyn To➜ ny
Greater the amount of edits to text, greater the dissimilarity of two text strings
String Similarity MetricsSimilarity Metric Substitution Deletion Insertion Transposition
Longest Common Substring
Levenshtein
Damerau – Levenshtein
Jaro – Winkler
Soundex NA NA NA NA
Jaro – Winkler is a heuristic measure for typos. Designed to implement penalty if characters at remote positions are changed, as these are probably not typos – they occur due to transpositions at similar positions in a string.
Talha vs. Tahla Talha vs. Lahaat
Soundex checks phonetic similarity for English words.
Application & Results
Similarity measures applied to relevant
columns
Using each similarity measure, records with the highest similarity
identified as duplicates and merged
4,243 unique donors found!
• Can be quite expensive!– Memory insufficiency (with R)– Computationally time-consuming
Consideration
Questions?
• Stylometry for Data Journalism– Actual Study– Short Presentation
• Names’ De-duplication– Confidential
Links to Presented Work
Should you like to Network now: Go ahead!
Otherwise: Thanks for joining this session!
Networking & Conclusion