phd viva - disambiguating identity web references using social data

Matthew Rowe - Disambiguating Identity Web References using Social Data

Disambiguating Identity Web References using Social Data

Matthew Rowe

Organisations, Information and Knowledge GroupDepartment of Computer Science

University of Sheffield


Outline

• Problem Setting• Research Questions• Claims of the Thesis• State of the Art• Requirements for Disambiguation and Seed Data• Disambiguating Identity Web References

– Leveraging Seed Data from the Social Web– Generating Metadata Models– Disambiguation Techniques

• Evaluation• Conclusions• Dissemination and Impact


Personal Information on the Web

• Personal information on the Web is disseminated:– Voluntarily– Involuntarily

• Increase in personal information:– Identity Theft– Lateral Surveillance

• Web users must discover their identity web references– 2 stage process

• Finding• Disambiguating

– Disambiguation = reduction of web reference ambiguity• My thesis addresses disambiguation


Ambiguity!


Matthew Rowe: Composer


Matthew Rowe: Cyclist


Matthew Rowe: Gardener


Matthew Rowe: Song Writer


Matthew Rowe: PhD Student


Problem Setting

• Performing disambiguation manually:– Time consuming– Laborious

• Handle masses of information– Repeated often

• The Web keeps changing

• Solution = automated techniques– Alleviate the need for humans– Need background knowledge

• Who am I searching for?• What makes them unique?


Research Questions

How can identity web references be disambiguated automatically?

1. Alleviate human processing:• Can automated techniques replace humans?

2. Supervision:• Can automated techniques function independently?

3. Seed Data:• How can this be gathered inexpensively?

4. Interpretation:• How can automated techniques interpret information?


Claims of the Thesis

• Automated disambiguation techniques are able to replace human processing– Retrieve and process information at large-scale– With high accuracy

• Data found on Social Web platforms is representative of real identity information– Platforms allow users to build a digital identity

• Social data provides the background knowledge required by automated disambiguation techniques– Overcoming the burden of seed data generation


State of the Art

• Disambiguation techniques are divisible into 2 types: – Seeded techniques

• E.g. [Bekkerman and McCallum, 2005], Commercial Services • Pros

– Disambiguate web references for a single person• Cons:

– Require seed data– No explanation of how seed data is acquired

– Unseeded techniques• E.g. [Song et al, 2007]• Pros

– Require no background knowledge• Cons

– Groups web references into clusters– Need to choose the correct cluster


Requirements

• Requirements for Seeded Disambiguation:– Bootstrap the disambiguation process with minimal supervision– Achieve disambiguation accuracy comparable to human processing– Cope with web resources not containing seed data features– Disambiguation must be effective for all individuals

• Requirements for Seed Data:– Produce seed data with minimal cost– Generate reliable seed data


Disambiguating Identity Web References


Harnessing the Social Web

• WWW has evolved into a web of participation• Digital identity is important on the Social Web

• Digital identity is fragmented across the Social Web• Data Portability from Social Web platforms is limited

http://www.economist.com/business/displaystory.cfm?story_id=10880936

http://www.economist.com/business/displaystory.cfm?story_id=10880936


Data found on Social Web platforms is representative of real identity information


User Study

• 50 participants from the University of Sheffield • Consisted of 3 stages, each participant:

1. List real world social network2. Extract digital social network3. Compare networks

M Rowe. The Credibility of Digital Identity Information on the Social Web: A User Study. In proceedings of 4th Workshop on Information Credibility on the Web, World Wide Web Conference 2010. Raleigh, USA. (2010)

Data found on Social Web platforms is representative of real identity information

Relevance: 0.23Coverage: 0.77

Updates previous findings [Subrahmanyam et al, 2008]


Leveraging Seed Data from the Social Web




M Rowe and F Ciravegna. Getting to Me - Exporting Semantic Social Network Information from Facebook. In proceedings of Social Data on the Web Workshop, ISWC 2008, Karlsruhe, Germany. (2008)

Use Semantics!

http://www.dcs.shef.ac.uk/~mrowe/foafgenerator.html

http://www.dcs.shef.ac.uk/~mrowe/foafgenerator.html



Link things together!



1. Blocking Step• Only compare people with

the same name2. Compare values of Inverse

Functional Properties• E.g. Homepage/Email

3. Compare Geo URIs• E.g. Matching locations

4. Compare Geo data• Using Linked Data sources

M Rowe. Interlinking Distributed Social Graphs. In proceedings of Linked Data on the Web Workshop, World Wide Web Conference, Madrid, Spain. (2009)



• Allows remote resource information to change• Automated techniques:

– Follow the links– Retrieve the instance information


Generating Metadata Models

• Input to disambiguation techniques is a set of web resources• Web resources come in many flavours:

– Data models– XHTML documents containing embedded semantics– HTML documents

4. Interpretation:How can automated techniques interpret information?

• Solution = Semantic Web technologies!– Convert web resources to RDF– Metadata descriptions = ontology concepts

• Information is– Consistent– Interpretable


Generating RDF Models from XHTML Documents

http://events.linkeddata.org/ldow2009/

http://events.linkeddata.org/ldow2009/


Generating RDF Models from XHTML Documents


Generating RDF Models from HTML Documents

• Rise in use of lowercase semantics!– However only 2.6% of web documents contain semantics

[Mika et al, 2009]• Majority of the web is HTML

– Bad for machines• Must extract person information

– Then build an RDF model• Person information is structured

– for legibility– for segmentation

• i.e. logical distinction between elements



• HTML is often poorly structured– Need a Document Object Model– Therefore Tidy it!



• Identify document segments for extraction– 1 window = Info about 1 person– Get Xpath expression to the window



• Extract information using a Hidden Markov Model– E.g. name, email, www, location– Train model parameters: Transition probs, emission probs, start probs– Use Viterbi algorithm to label tokens with states

– Returns most likely state sequence



M Rowe. Data.dcs: Converting Legacy Data into Linked Data. In proceedings of Linked Data on the Web Workshop, World Wide Web Conference 2010. Raleigh, USA. (2010)


1. Extract instances from Seed Data2. For each instance, build a rule:

• Build a skeleton rule• Add triples to the rule• Create a new rule if a triple’s predicate is Inverse Functional

3. Apply the rules to the web resources

Disambiguation 1: Inference Rules


1. Extract instances from Seed Data2. For each instance, build a rule:

• Build a skeleton rule• Add triples to the rule• Create a new rule is a triple’s predicate is Inverse Functional




1. Extract instances2. For each instance, build a rule:



PREFIX foaf:<http://xmlns.com/foaf/0.1/>CONSTRUCT { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:page ?url }WHERE {<http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:name ?n .

?url foaf:topic ?p .?p foaf:name ?n .

<http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:knows ?q .?q foaf:name ?m .?url foaf:topic ?r .?r foaf:name ?m

}








<http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:knows ?q .?q foaf:homepage ?h .?url foaf:topic ?r .?r foaf:homepage ?h

}





3. Apply the rules



<http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:knows ?q .?q foaf:homepage ?h .?url foaf:topic ?r .?r foaf:homepage ?h

}



Advantages:• Highly precise• Applies graph patterns

Disadvantages:• Does not learn from past decisions (supervised)• Strict matching: lack of generalisation

M Rowe. Inferring Web Citations using Social Data and SPARQL Rules. In proceedings of Linking of User Profiles and Applications in the Social Semantic Web, Extended Semantic Web Conference 2010. Heraklion, Crete. (2010)



Disambiguation 2: Random Walks

• Seed data and web resources are RDF– RDF has a graph structure:

<subject, predicate, object><source_node, edge, target_node>

• Graph-based disambiguation techniques:– E.g. [Jiang et al, 2009]– Build a graph-space– Partition data points in the graph-space

• Requires methods to:– Compile a graph-space– Compare nodes– Cluster nodes



• Link the social graph with the web resources• Via common resources/literals


Disambiguation: Random Walks



• Graph space may contain islands of nodes• Inhibit transitions through the graph space

• Get the component containing the social graph



• Perform Random Walks through the graph1. Derive Adjacency Matrix 2. Derive Diagonal Degree Matrix 3. Compute Transition Probability Matrix



• Measure Distances:• Commute Time distance

• Leave node i : reach node j : return to node i• Optimum Transitions

• Move through the graph until probability peaks


Disambiguation: Random Walks

• Measure Distances:• Commute Time distance

• Leave node i : reach node j : return to node i• Optimum Transitions

• Move through the graph until P peaks



• Group web resources with social graph• Via agglomerative clustering• Every point is in a cluster• Merge clusters until none can be merged



Advantages:• Semi-supervised• Exploits the graph structure of RDF

Disadvantages:• Computationally heavy (Matrix powers!)• Relies on tuning clustering threshold

M Rowe. Applying Semantic Social Graphs to Disambiguate Identity References. In proceedings of European Semantic Web Conference 2009, Heraklion, Crete. (2009)


Disambiguation 3: Self-training

• Classic ML scenario:– Lots of unlabelled data– Limited labelled data

• Disambiguating identity web references is just the same!– Possible web citations = large– Social data = small

• Semi-supervised learning is a solution– Train a classifier– Using labelled and unlabelled data!

• Classification task is binary– Does this web resource refer to person X or not?


• Positive training data = seed data• Generate negative training data:

– Via Rocchio classification:1. Build centroid vectors: positive set and negative set

• Negative set = unlabelled data

2. Compare possible web citations with vectors3. Choose strongest negatives



• Begin Self-training:1. Train the Classifier2. Classify the web resources3. Rank classifications4. Enlarge training sets5. Repeat steps 1-4



• Training/Testing data is RDF• Convert to a machine learning dataset

– Features = RDF instances• Vary the feature similarity measure:

– Jaccard Similarity– Inverse Functional Property Matching– RDF Entailment

• Tested three different classifiers:– Perceptron– Support Vector Machine– Naïve Bayes



• Advantages– Directly learn from disambiguation decisions– Utilise abundance of unlabelled data

• Disadvantages– Requires reliable negatives– Mistakes can reinforce themselves

M Rowe and F Ciravegna. Harnessing the Social Web: The Science of Identity Disambiguation. In proceedings of Web Science Conference 2010. Raleigh, USA. (2010)



Evaluation

• Measures:– Precision, Recall, F-Measure

• Dataset– 50 participants from the Semantic Web and Web 2.0 communities– ~17300 web resources: 346 web resources for each participant

• Baselines– Baseline 1: Person name as positive classification– Baseline 2: Hierarchical Clustering using Person Names

• [Malin, 2005]– Baseline 3: Human Processing


Evaluation: Inference Rules

• High precision– Better than humans– Precise graph pattern matching

• Low recall– Rules are strict

• No room for variability– Hard to generalise

• No learning from disambiguation decisions

Precision Recall F-MeasureInference Rules 0.955 0.436 0.553Baseline 1 0.191 0.998 0.294Baseline 2 0.648 0.592 0.556Baseline 3 0.765 0.725 0.719


Evaluation: Random Walks

• High recall– Higher than humans– Incorporates unlabelled data into random walks

• Uses features not in the seed data

• Precision– Lower than humans and rules– Ambiguous name literals lead to false positives

Precision Recall F-MeasureCommute Time 0.707 0.798 0.705Optimum Transitions 0.659 0.805 0.684Baseline 1 0.191 0.998 0.294Baseline 2 0.648 0.592 0.556Baseline 3 0.765 0.725 0.719


Evaluation: Self-trainingPrecision Recall F-Measure

Perceptron + Entailment 0.629 0.905 0.728

Perceptron + IFP 0.630 0.878 0.715

Perceptron + Jaccard 0.651 0.820 0.700

SVM +Entailment 0.613 0.910 0.731

SVM + IFP 0.628 0.864 0.711

SVM + Jaccard 0.755 0.695 0.691

Baseline 1 0.191 0.998 0.294

Baseline 2 0.648 0.592 0.556

Baseline 3 0.765 0.725 0.719

• High Recall– SVM + Entailment classifies 91% of references

• High F-Measure– Higher than humans

• Perceptron + Entailment and SVM + Entailment


Conclusions: Research Questions

1. Alleviate human processing:• Can automated techniques replace humans?

– Performance is comparable to humans– Suited to low web presence

2. Supervision:• Can automated techniques function independently?

– Inference Rules : Induce rules from seed data– Random Walks : Graph space built from models– Self-training : Learn + retrain a classifier


– Utilise Social Web platforms– Digital identities are similar to real world identities

4. Interpretation:• How can automated techniques interpret information?

– Solution = Semantic Web technologies– Convert web resources into metadata models


Conclusions: Claims

• Automated disambiguation techniques are able to replace human processing– Techniques are comparable to humans– Overcome manual processing

• Data found on Social Web platforms is representative of real identity information– 77% of a real world social network is covered online

• Social data provides the background knowledge required by automated disambiguation techniques– Techniques function using social data– Biographical and social network enables disambiguation


Dissemination and Impact

• Published 21 peer-reviewed publications– Paper in the Journal of Web Semantics (impact: 3.5)– Presented work at many international conferences

• Program committee member for 5 international workshops• Invited Expert for the World Wide Web Consortium’s Social Web Incubator

Group• Listed as one of top 100 visionaries “discussing the future of the web”

http://www.semanticweb.com/semanticweb100/• Linked Data service for the DCS

– Best Poster at the Extended Semantic Web Conference 2010http://data.dcs.shef.ac.uk

• Tools widely used by the Semantic Web community– FOAF Generator– Social Identity Schema Mapping (SISM) Vocabulary

http://data.dcs.shef.ac.uk/


Questions?

Twitter: @mattroweshowWeb: http://www.dcs.shef.ac.uk/~mroweEmail: [email protected]

M Rowe and F Ciravegna. Disambiguating Identity Web References using Web 2.0 Data and Semantics. In Press for special issue on "Web 2.0" in the Journal of Web Semantics. (2010)

For a condensed version of my thesis:

phd viva - disambiguating identity web references using social data

Education