machine learning and the semantic web hendrik blockeel katholieke universiteit leuven department of...

Machine Learning and the Semantic Web

Hendrik BlockeelKatholieke Universiteit Leuven

Department of Computer Science

Thanks : Raymond Kosala, Nico Jacobs

Overview

Machine learning and data mining Relationship with semantic web

Synergy between both Some concrete examples

Document classification Information integration

Conclusions

Machine Learning & Data Mining

Related technology, different focus Machine learning:

Programs that improve their performance on certain tasks Focus on adaptive behaviour

Data mining: Discovering implicit knowledge (regularities) in large

amounts of data Focus on handling large amounts of data

Very useful technology in the context of the Web

Learning Agents

Programs that Learn the user’s preferences

Make life for the user as simple as possible E.g., intelligent mail reader E.g., adaptive web pages

Move links, create “direct” links, ... Index page synthesis (Perkowitz & Etzioni, IJCAI 1999)

Learn how to find reliable information E.g., learn which other people have similar preferences to

this user, use their opinions to make suggestions

(other applications: learning to play games, ...)

Mining the Web

Analyze data that are available on the Web Distinguish 3 types:

Web content mining Look in contents of documents (text, ...)

Web structure mining Look at links between documents

Web usage mining Look at user logs (e.g. who accessed a web page, which

links often used, ...)

Web Content Mining

Relies on information extraction E.g., in a text: find keywords, ...

Techniques from machine learning, statistics, ... used to guess from context

what a word means what its function in the text is ...

Fill a schema with specific slots, based on analysis of text

Even more complicated: recognise objects in pictures, ...

I.E. is a complex matter

Mining for Genes

Jenssen et al. (2001), Nature Genetics 28, “A literature network of human genes”

Mining MEDLINE database of abstracts Find names of genes occurring together Construct similarity graph Construct a database with this information Database contains knowledge no single individual

has, or could obtain without data mining Similar techniques could be used on the web

One extra problem: uncertainty about reliability

Web Structure Mining

Analyse structure of the web Which sites have many incoming / outgoing links?

Identify “hubs”

Find clusters of sites that are strongly interconnected Web communities

... E.g., Google

Identifies important pages based on links that point to it (rather than contents of page itself)

Web Usage Mining

Log user behaviour Which links are often followed, in which order, how

long is a page looked at, ... Possible at several levels:

General usage statistics User-specific statistics

Relating behaviour to properties of user, insofar available

E.g., adaptive web sites Adaplix project automatic index page creation

Web Mining As It Currently Is

Machine learning / data mining strongly rely on Data quantity Data quality

Quantity is usually not a problem on the Web Quality is!

Much data not in easily processable format E.g. Inside text documents : need information extraction Unstructured, poorly structured, heterogeneously structured

Lots of noise ...

How Is All This Related to the Semantic Web?

There can be a synergy : Machine learning can help with building the

Semantic Web The Semantic Web will help mining the Web,

making Web interfaces and agents more intelligent

What Machine Learning Can Do for the Semantic Web

Upgrading the current web to a semantic web involves a lot of work

Can partially be automated! Examples:

Learning ontologies Automatic document classification Information integration ...

Learning Ontologies

Maedche & Staab (2001), “Ontology learning for the semantic web”

View: Manually creating of ontologies is very labour-

intensive Fully automating creating of ontologies is not feasible Hence: develop tool that helps building ontologies

Basic components: Good graphical interface (interaction man-machine) Powerful underlying machine learning techniques

Text-To-Onto

Framework : Import / reuse existing ontologies Extract ontology from documents

Identify new terms, map onto existing concepts or define new ones

Identify relationships between concepts ... Many opportunities for general machine learning techniques

Prune ontology Refine ontology

Some Useful Techniques for Learning Ontologies

Term extraction from texts Identification of concepts

Hierarchical Clustering Clustering: finding groups of “similar” things Hierarchical clustering: clusters of clusters Taxonomy can be constructed through hierarchical

clustering of concepts Association rules

Find sets of terms that often occur together May indicate important relations

E.g., events in texts often co-occur with locations

Information Integration

Doan, Domingos, Halevy: “Reconciling Schemas of Disparate Data Sources”, ACM SIGMOD 2001

Context: Given databases with different schemas:

Find similarities in schemas, guess how concepts map onto each other

Integrate the schemas

Essentially the same as mapping ontologies onto each other

Automated Document Classification

Mitchell et al. Based on examples of web pages + what kind of page

they are (course page, student page, ...), Learn to classify new pages Can be based on contents of page, links pointing to

page, typical structure of certain kinds of web sites (e.g. universities), ...

Note: helps to relate objects to ontology Problem: how to get labeled examples

Unlimited amount of unlabelled pages available But labelling them manually is labour intensive!

Exploiting Unlabelled Data

A solution: co-training (Blum & Mitchell 1998) Learn separate (imperfect) classifiers from disjoint

sets of sufficient information E.g. Learn to classify pages from

Content of page (“Home page of CS 101”) Links pointing to page (“CS 101”)

Take classifications that classifier A is most certain of, add these labels to training set for B (and vice versa)

Repeat multiple times (kind of bootstrapping process) Co-training allows to exploit large amounts of

unlabelled data!

What the Semantic Web Can Do for Machine Learning

Will make mining the web much easier Reason 1: removal of ambiguity

More precise knowledge of what is meant with certain terms

Reason 2: structured vs. unstructured data Learning from structured data is much easier than

from unstructured data Reason 3: availability of background knowledge

Can be used to make better decisions when learning

Removal of Ambiguity

Example: text document classification E.g., given a text, tell in which newsgroups it belongs

Typical approaches: “bag of words” Look only at which words occur, in the text, and how

often Each time a word occurs that occurs mainly in one

particular class, increase probability for that class But words are ambiguous! Increased classification accuracy can be expected by

removing ambiguity

Mining From (Un)structured Data

Mining data = intensively querying data Answering a querying is

Easy in structured data Relational database, XML, ...

Harder in semi-structured data (e.g., HTML) Hard in unstructured data

Information exraction needed Could do this by learning a “wrapper” This involves one extra layer of learning

Relating this to our text example: taking into account function of words in text

Availability of Background Knowledge

Learning = finding relevant patterns in behaviour Important to have the right context to describe

these patterns Example:

Making interesting offers to clients “People who bought this book also bought ...” = “Instance-based” learning

Estimate profile of user Find users with similar profile Look at behaviour of those users to help current user

Availability of Background Knowledge

Can work better if more background knowledge is available, e.g., type of book, author, ... For instance, for books:

“similar profile” = users that up till now bought same books as this user

May not be many people “similar” = often bought books by same author

Probably many more people, allows for more reasonable guess “similar” = often bought books of same genre (fiction, ...)

May work even better

Ontologies (among other) provide such background knowledge

Web Mining Revisited

Semantic Web will change Content mining

Clearer view on contents and meaning of documents

Structure mining More relevant structure

Usage mining More relevant information on actions of user

Will in general improve intelligence of systems E.g. mail filter gets a better view of contents of mails

Promising Learning Techniques

Many different learning techniques exist Neural networks, support vector machines, instance-

based learning, bayesian learning, association rules, ... Not all equally suitable for any task

E.g. SVM for document classification works well E.g. instance-based learning: find other users with same profile

as this user to make predictions Intelligent agents will use a mix of them Relational learners seem interesting

Can handle explicit information on objects and relations between them

Classic example: Inductive logic programming

Inductive Logic Programming

Induces rules in first order logic from examples or other rules Such rules can be used to reason with The reasoning can be explained

Cf. example of mail program

Can use existing background knowledge “knowledge intensive learning” Currently: good background knowledge has to be

engineered manually Will become more easily available with semantic web Example: mining in chemical domains

Mining in chemical domains

Example problem: relate activity of molecule to its properties Useful for, e.g., drug development Which properties are important?

Chemically relevant properties: functional groups, 3D structure, ... ?

Has to be encoded manually Ideally: get relevant information from some

trustworthy data source as and when needed Intelligent agents will exploit (“tap”) the common

intelligence of the Web

Conclusions

Machine learning is an promising tool for the Semantic Web For building it For exploiting it

Clear synergy between Semantic Web efforts and Machine Learning efforts

Some References

Maedche, “A Machine Learning Perspective for the Semantic Web”, position paper www.semanticweb.org/SWWS/program/position/soi-maedche.pdf

Maedche & Staab (2001): Ontology Learning for the Semantic Web, IEEE Intelligent Systems 16(2)

Jenssen et al., Nature Genetics 28 Doan et al. (2001), ACM SIGMOD conf. Kosala & Blockeel (2000), SIGKDD Explorations 2(1) Mitchell (1996), Machine Learning

machine learning and the semantic web hendrik blockeel katholieke universiteit leuven department of...

Documents

web mining

web slide

web content mining

web quality

web structure mining

data mining relationship

adaptive behaviour data

adaptive web pages