riyadh user group - 1st meeting (dec 2016(

Click here to load reader

Post on 20-Mar-2017

90 views

Category:

Data & Analytics

3 download

Embed Size (px)

TRANSCRIPT

Innovating and Being Creative with

Innovating and Being Creative

with 1st Riyadh UseR Meetup15th December 2016Allure Hub, King Fahd Road

: https://goo.gl/O4rnv3

: https://goo.gl/gx0iWX

Ali Kazmi

http://goo.gl\IcwGiB

https://goo.gl/eBGEKd

@scac1041

https://goo.gl/tiWAMm

Meeting

MeetingUseR Group

MeetingUseR GroupMeetup Group

MeetingUseR GroupMeetup Group

1st things 1st

Objectives

ObjectivesPromote Usage of R Statistical Data Analysis Tool General Purpose Programming ToolPromote Computational ThinkingPromote Creativity with REnable Riyadh useRs to become Data Analytic Citizens

ObjectivesPromote Usage of R Statistical Data Analysis ToolGeneral Purpose Programming ToolPromote Computational ThinkingPromote Creativity with REnable Riyadh useRs to become Data Analytic Residents

Content Coverage

Content CoverageCommercial SettingsUse cases for Commercial work

Personal SettingsUse cases for possibly non-Commercial/Private work

Structure of UseR Meetup TeamAli Kazmi (Organiser)__________________

Not a one man show, please.

We R a community, and this is a community run project, for the community, by the community12

Todays PresentationsPersonal SettingData Journalism with R and Stylometry: Identifying number of writers for a Prime Minister's speeches

Commercial SettingData de-duplication: Analysing misspelled names to identify which refer to the same person

Using Stylometry to Identify Authorship of Texts

A series of events prompt the Pakistani Prime Minister to address the nation

A speech is delivered...

And, thereafter, an Audio clip is leaked, showing the PM taking advice on writing style

Journalists wondered if the PM takes advice on writing style for important speeches only.

Are some other speeches also a product of such brainstorming sessions?

Media wondered if the PM takes advice on writing style for important speeches only.

Are some other speeches also a product of such brainstorming sessions?

How can we answer this?

Media wondered if the PM takes advice on writing style for important speeches only.

Are some other speeches also a product of such brainstorming sessions?

How can we answer this?

Stylometry is Linguistics + Statistics applied to detect stylistic changes in text

Stylometry is Linguistics + Statistics applied to detect stylistic changes in textAssumption of Stylometry: Each writer has a distinct style of writing that is unconsciously learnt and used.

Various aspects of text can capture Stylistic variation:Punctuation MarkersLength of a sentenceVocabulary RichnessParts of Speech Function Words; , . ! Actually I dont think that it is good because of the fact that this is not theIt behoves me to accomplish this work.Verb, Noun, Adjective, Adverb, Conjunction, etc.That, but, therefore, and, etc.What characterises a persons writing style?

ApplicationsJ. K. Rowling & Galbraith

Writing Style in Novels

RoadmapExtract

Quantify

Analyse

Visualise

Multi-Dimensional Scaling, PCA, Bootstrap Consensus Trees

Traditional Journalism vs. Data JournalismTraditional Journalism

Data Journalism

Considerations in StylometrySize of dataset/corpus

Open World Problem

Relatively new field

Questions?

Data De-duplication: Analysing misspelled names to identify which refer to the same person

Client approaches us for analysing transactional data with reference to contact names1

Client approaches us for analysing transactional data with reference to contact names

12

Typos, variation in names

Hamza Sheikh vs. Humza Shaikh vs. Hamza Sheik vs. Hazma Shiekh

Client approaches us for analysing transactional data with ref. to contacts

12

Typos, variation in names - Hundreds of Thousands of records - 5 Days

What to do?

Problem and Solution ElicitationPattern of errorsTyping MistakesMinor Displacement of letters

SolutionPattern Matching ~ Risky, Time-consumingString Matching Algorithms

String Matching Algorithms

stringdist package in R

Edit-based distance measuresIncludes:Deletion AdditionSubstitution TranspositionGenerally: Edit a string,count iterations of editLess iterations = less distance = similar names!

Examples of Edit-Based MeasuresHow many Insertions to obtain a particular text?

Duba Dubai

How many Substitutions to obtain a particular text?

Tony Rony

How many Deletions to obtain a particular text

Swisss Swiss

How many Transpositions to obtain a particular text?

Toyn Tony

Greater the amount of edits to text, greater the dissimilarity of two text strings

There are different ways of measuring similarity of character data (hueristic approaches, q-grams, edit-based measures).

We chose edit-based measures + heuristic approach [emphasise this is lent from intuition)Explain the slideAimia 4x3 Template v10 GREEN12/15/201636

String Similarity MetricsSimilarity MetricSubstitutionDeletionInsertionTranspositionLongest Common Substring

LevenshteinDamerau Levenshtein

Jaro WinklerSoundexNANANANA

Jaro Winkler is a heuristic measure for typos. Designed to implement penalty if characters at remote positions are changed, as these are probably not typos they occur due to transpositions at similar positions in a string.

Talha vs. Tahla Talha vs. Lahaat

Soundex checks phonetic similarity for English words.

Explain the slide.Explain Jaro-Winkler: heuristic specifically formulated for typos/incorrect data entry; measures similarity by accounting for character mismatches taking into account a finding that fewer typos typically occur at the beginning as opposed to the end of words.Soundex: checks phonetic structure of (English) words similar phonetic structure increase possibility of the record s being duplicates.Aimia 4x3 Template v10 GREEN12/15/201637

Application & Results

Each similarity measure was applied to the data.

Separating the wheat from chaff: only high similarity records were identified as being duplicates

Aimia 4x3 Template v10 GREEN12/15/201638

Can be quite expensive!Memory insufficiency (with R)Computationally time-consumingConsideration

Questions?

Stylometry for Data JournalismActual StudyShort Presentation

Names De-duplicationConfidential Links to Presented Work

Should you like to Network now: Go ahead!

Otherwise: Thanks for joining this session!

Networking & Conclusion