finish section on linked data begin data cleaning and pre ... · json-ld (example from json-ld.org)...

44
Today Finish Section on Linked Data Begin data cleaning and pre-processing topic

Upload: others

Post on 01-Oct-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Today

•  Finish Section on Linked Data •  Begin data cleaning and pre-processing topic

Page 2: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Graphs: Social networks

https://www.flickr.com/photos/marc_smith/5592302165

Page 3: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Protein-Protein Interactions

http://www.nature.com/nrg/journal/v5/n2/fig_tab/nrg1272_F2.html

Page 4: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

The Internet Graph (https://en.wikipedia.org/wiki/Opte_Project)

Page 5: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Linked Data

•  We need to connect data together --- form links. –  A key part of the Semantic Web –  Also important for the Internet of Things

•  (26 billion things by 2020, each continuously producing data)

1.  Principles of links from Tim Berners-Lee 1.  All kinds of conceptual things, they have names now that start with

HTTP. 2.  If I take one of these HTTP names and I look it up, I will get back

some data in a standard format which is kind of useful data that somebody might like to know about that thing, about that event.

3.  When I get back that information it's not just got somebody's height and weight and when they were born, it's got relationships. And when it has relationships, whenever it expresses a relationship then the other thing that it's related to is given one of those names that starts with HTTP.

Page 6: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Linked Data Examples

•  DBPedia –  ~5 million “things” from Wikipedia –  Can be linked to external datasets such as CIA World

Factbook, US Census Data –  “Give me all cities in New Jersey with more than 10,000

people

•  Freebase •  FOAF (friend of a friend) •  Google Knowledge Graph

•  https://www.google.com/intl/bn/insidesearch/features/search/knowledge.html

Page 7: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Standards for Linked Data

•  Widely used standards (W3C Recommendations) –  JSON-LD (JSON Linked Data) –  RDF (Resource Description Framework)

Page 8: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

JSON-LD (example from json-ld.org)

•  Provide mechanisms for specifying unambiguous meaning in JSON data

•  Provides extra keys with “@” sign –  “@context” (used to define meanings of terms, map to

identifiers) –  “@type” –  “@id”

•  Use cases –  Google Knowledge Graph

Page 9: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

JSON-LD Example (from https://en.wikipedia.org/wiki/JSON-LD)

{"@context": { "name": "http://xmlns.com/foaf/0.1/name", "homepage": { "@id": "http://xmlns.com/foaf/0.1/workplaceHomepage", "@type": "@id" }, "Person": "http://xmlns.com/foaf/0.1/Person" }, "@id": "http://me.example.com", "@type": "Person", "name": "John Smith", "homepage": "http://www.example.com/" }

Page 10: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Graphs – RDF (Resource Description Framework) [materials from w3.org]

Page 11: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Serialisation of RDF Example Graph

This graph can be serialised as XML (don’t worry about syntax!)

<?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:contact="http://www.w3.org/2000/10/swap/pim/contact#">

<contact:Person rdf:about="http://www.w3.org/People/EM/contact#me"> <contact:fullName>Eric Miller</contact:fullName> <contact:mailbox rdf:resource="mailto:[email protected]"/> <contact:personalTitle>Dr.</contact:personalTitle> </contact:Person>

Page 12: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

RDF – Triple Store

•  An alternative format for storing RDF type data – triple store <http://www.w3.org/People/EM/contact#me> <http://www.w3.org/2000/10/swap/pim/contact#fullName> "Eric Miller" . <http://www.w3.org/People/EM/contact#me> <http://www.w3.org/2000/10/swap/pim/contact#mailbox> <mailto:[email protected]> . <http://www.w3.org/People/EM/contact#me> <http://www.w3.org/2000/10/swap/pim/contact#personalTitle> "Dr." . <http://www.w3.org/People/EM/contact#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2000/10/swap/pim/contact#Person> .

Page 13: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Freebase

•  A large database that connects entities (facts, people, places, organizations …) together as a graph –  www.freebase.com –  Freebase is the basis of the Google Knowledge graph that is

used to improve search. •  https://developers.google.com/knowledge-graph/

•  Retrieving data from the Google Knowledge Graph –  Example adapted from http://www.nolan-nichols.com/

knowledge-graph-via-sparql.html

Page 14: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Other formats for Graphs: Matrix Representation

A

C

D

B A B C D

A 0 0 1 0 B 0 0 0 0 C 0 1 0 0 D 0 1 0 0 A ‘1’ in the matrix iff there is an edge from node X to node Y. Or use a relational table

Source Destination

A C C B D B

Page 15: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

What you should know about data formats

•  -Why do we have different data formats and why do we wish to transform between different formats?

•  -Motivation for using relational databases to manage information •  -Different between a (standard) relational database and a nosql database •  -What is a csv, what is a spreadsheet, what is the difference? •  -Be able to write regular expressions in python format (operators .^$*+|[]) •  -Difference between HTML and XML and when to use each •  -Motivation behind using XML and XML namespaces •  -Be able to read and write data in XML (elements, attributes, namespaces) •  -Be able to read and write data in JSON •  -Difference between XML and JSON. Applications where each can be used. •  -The purpose of using schemas for XML and JSON data. •  -The motivation behind Linked Data and the purpose of using JSON-LD or RDF

to represent it.

Page 16: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Further reading

•  Further reading –  Relational databases

•  Pages 403-409 of http://i.stanford.edu/~ullman/focs/ch08.pdf

–  XML •  http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SG.html

–  JSON and JSON-LD •  http://json.org •  http://crypt.codemancers.com/posts/2014-02-11-An-introduction-to-

json-schema/ •  https://cloudant.com/blog/webizing-your-database-with-linked-data-in-

json-ld/#.Vtp_UMfB_Gw –  RDF

•  https://www.w3.org/DesignIssues/LinkedData.html •  http://www-sop.inria.fr/acacia/cours/essi2006/Scientific%20American_

%20Feature%20Article_%20The%20Semantic%20Web_%20May%202001.pdf

•  http://www.dlib.org/dlib/may98/miller/05miller.html

Page 17: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

COMP20008 Elements of Data Processing Data Pre-Processing and Cleaning

Page 18: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Why is pre-processing needed?

Name Age Date of Birth

“Henry” 20.2 20 years ago

Katherine Forty-one 20/11/66

Michelle 37 5/20/79

Oscar@!! “5” 13th Feb. 2011

- 42 -

Mike___Moore 669 -

巴拉克奥巴⻢马 52 1961年8月4日

Page 19: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Why is pre-processing needed?

•  Measuring data quality –  Accuracy

•  Correct or wrong, accurate or not –  Completeness

•  Not recorded, unavailable –  Consistency

•  E.g. discrepancies in representation –  Timeliness

•  Updated in a timely way –  Believability

•  Do I trust the data is correct? –  Interpretability

•  How easily can I understand the data?

Page 20: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Major data preprocessing activities

Data mining concepts and techniques, Han et al 2012

Page 21: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Terminology

Height Weight Age Gender 1.8 80 22 Male 1.53 82 23 Male 1.6 62 18 Female

•  The 4 columns (height, weight, age, gender) are features or attributes

•  The data items (3 rows) are called instances or objects •  Height, Weight and Age are continuous features •  Gender is a categorical or discrete feature

Page 22: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Data integration

•  Bringing data from multiple sources together –  Resolve conflicts –  Detect duplicates

•  Will cover in depth in weeks 8 and 9

Data Source

Data Source

Integrated Data Source

Page 23: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Data reduction

•  Decrease the number of features (columns) or instances (rows) –  Sampling strategies –  Remove irrelevant features and reduce noise –  Easier to visualise, faster to analyse

•  Will cover during section on visualisation (weeks 5 and 6), and feature analysis (weeks 9 and 10)

http://bigdataexaminer.com/data-science/understanding-dimensionality-reduction-principal-component-analysis-and-singular-value-decomposition/

Page 24: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Data cleaning

•  Incomplete (missing data) •  Noisy data •  Inconsistent data •  Intentionally disguised data

Page 25: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Data cleaning – The Process

•  Many tools exist (Goole Refine, Kettle, Talend, …) –  Data scrubbing –  Data discrepancy detection –  Data auditing –  ETL (Extract Transform Load) tools: users specify

transformations via a graphical interface •  Our emphasis will be to understand some of the methods

employed by some of these tools

Page 26: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Missing or incomplete data

•  Lacking feature values •  Name=“” •  Age=null

•  Types of missing data (Rubin 1976) –  Missing completely at random: Data are missing

independently of observed and unobserved data. –  E.g/ Coin flipping to decide whether or not to answer

an exam question. –  Missing not completely at random

•  I create a dataset by surveying the class about how healthy they feel. What is the meaning of missing values for those who don’t respond?

•  I set an exam and ask a question in hard to understand language. What is the meaning of missing values for those who don’t answer the question?

Page 27: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Example: USA Salary survey data

•  Is Person B’s salary missing at random? •  Very difficult to determine reasons for missingness.

–  In practice report assumptions about missingness.

Name Salary Person C $59k Person D $63k Person H $99k Person E $102k Person G $140k Person F $150k Person A $180k Person B -

Page 28: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Causes of missing data

•  Why does it occur? –  Malfunction of equipment (e.g. sensors) –  Not recorded due to misunderstanding –  May not be considered important at time of entry –  Deliberate

•  How to handle it? –  We will look at a number of strategies

Page 29: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Extreme Missing data

•  Movie Recommender systems

Person Star Wars

Batman Jurassic World

The Martian

The Revenant

Lego Movie

Selma ….

James 3 2 - - - 1 - John - - 1 2 - - - Jill 1 - - 3 2 1 -

Users and movies Each user only rates a few movies (say 1%) Netflix wants to predict the missing ratings for each user

Page 30: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Noisy data

•  Truncated fields (exceeded 80 character limit) •  Text incorrectly split across cells (e.g. separator issues) •  Salary=“-5” •  Some causes

–  Imprecise instruments –  Data entry issues –  Data transmission issues

Page 31: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Inconsistent data

•  Different naming representations (“Melbourne University” versus “University of Melbourne”) or (“three” versus “3”)

•  Different date formats (“3/4/2016” versus “3rd April 2016”) •  Age=20, Birthdate=“1/1/2002” •  Two students with the same student id •  Outliers

–  E.g. 62,72,75,75,78,80,82,84,86,87,87,89,89,90,999 •  No good if it is list of ages of hospital patients •  Might be ok though for a listing of people number of

contacts on Linkedin though –  Can use automated techniques, but also need domain

knowledge

Page 32: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Disguised data

•  Everyone’s birthday is January 1st? •  Email address is [email protected] •  Adriaans and Zantige

–  “Recently, a colleague rented a car in the USA. Since he was Dutch, his post-code did not fit the fields of the computer program. The car hire representative suggested that she use the zip code of the rental office instead.”

•  How to handle –  Look for “unusual” or suspicious values in the dataset, using

knowledge about the domain

Page 33: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Dealing with missing data

•  What are the consequences of missing data? –  May break application programs not expecting it –  Less power for later analysis analysis –  May bias later analysis

•  So, how to handle it?

Page 34: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Strategy 1: Delete all instances with a missing value

•  Sometimes called case deletion •  Effects

–  Easy to analyse the new (complete data) –  May produce bias on analysis if new sample size small or

structure exists in the missing data.

Page 35: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Case deletion

Person Star Wars

Batman

Jurassic World

The Martian

The Revenant

Lego Movie

Selma

Mandy 1 2 1 3 3 2 3

Person Star Wars

Batman

Jurassic World

The Martian

The Revenant

Lego Movie

Selma

Mandy 1 2 1 3 3 2 3

James 3 2 - - - 1 -

John - - 1 2 - - -

Jill 1 - - 3 2 1 -

Page 36: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Strategy 2: Manually correct

•  A human eyeballs the missing value and fills it in using their expert knowledge

https://en.wikipedia.org/wiki/Eye

Page 37: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Strategy 3: Imputation

•  Impute a value (replace the missing value with a substitute one) •  After imputing all missing values, can use standard analysis

techniques for complete datasets

Person Star Wars

Batman

Jurassic World

The Martian

The Revenant

Lego Movie

Selma ….

James 3 2 2 2 1 1 1

John 3 2 1 2 2 1 1

Jill 1 1 1 3 2 1 1

Person Star Wars

Batman

Jurassic World

The Martian

The Revenant

Lego Movie

Selma ….

James 3 2 - - - 1 -

John - - 1 2 - - -

Jill 1 - - 3 2 1 -

Page 38: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Imputation: Fill in with zeros (or similar)

Person Star Wars

Batman Jurassic World

The Martian

The Revenant

Lego Movie

Selma ….

James 3 2 0 0 0 1 0

John 0 0 1 2 0 0 0

Jill 1 0 0 3 2 1 0

•  Simple •  Won’t break application programs •  Limited utility for analysis

Page 39: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Imputation: Fill in with mean value

•  Popular method –  Can be good for supervised classification –  Apply separately to each attribute

Name Age

Daisy 10

Maisy 15

Harry 2

Jackie -

Jackie’s age is imputed to be (10+15+2)/3=9

Page 40: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Imputation: Fill in with mean value cont

•  Drawbacks –  Reduces the variance of the feature –  Incorrect view of the distribution of that attribute –  Relationships to other features changes

•  Can also use median instead of mean (if distribution is skewed) •  Use mode (most frequent value) imputation for categorical

features

Page 41: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Fill in with category mean

•  Take categories/clusters and compute the mean ….

Name Age Gender Daisy 10 Female Maisy 15 Female Harry 2 Male Jackie - Female

Jackie’s age is imputed to be (10+15)/2=12.5 (considering the category “Female”)

Page 42: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Time series: Last value carried forward

Day Kilometres Walked Day 1 8.9 Day 2 8.2 Day 3 9.6 Day 4 Day 5 11.6 Day 6 12.0

Kilometres walked on Day 4 = ?

Page 43: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Acknowledgements

–  Data Mining Concepts and Techniques. Han, Kamber and

Pei. 3rd edition (chapter 3). Available through library as ebook.

–  Data analysis using regression and multilevel hierarchical models. Gelman and Hill (chapter 25), 2006.

Page 44: Finish Section on Linked Data Begin data cleaning and pre ... · JSON-LD (example from json-ld.org) • Provide mechanisms for specifying unambiguous meaning in JSON data • Provides

Next Week

•  Second workshop is available on LMS –  Practice with JSON and XML and Web scraping

•  Project will be released •  Continue data-preprocessing and cleaning

–  Look at more complex techniques for value imputation (e.g. for the movie recommender system example)