hacking dbpedia

Download Hacking Dbpedia

If you can't read please download the document

Upload: forewer2000

Post on 26-Jan-2016

215 views

Category:

Documents


1 download

DESCRIPTION

Notes about sparql queries with dbpedia

TRANSCRIPT

HACKING DBPEDIA WITH SPARQL

BASICS

General form of an SPARQL query is:

PREFIX ......SELECT ... FROM ...WHERE {triples -

- filters -}GROUP BY ...ORDER BY ..HAVING ..LIMIT ...

An rdf database consist on a set of triples having connections between them.A triple resource has tree specific element, in this specific order:

SUBJECT -> PROPERTY -> OBJECT

We can say, that a triple is a subject having the property of object.

Examples:CAT IS BLACK, TOM (IS AN) ACTOR, PETER (IS A) NAME.

An rdf URL is a resource where we can access each properties of a subject.In dbpedia each subject and property is an actual url. Objects could be urls or strings.

Example url of a subject:http://dbpedia.org/resource/JavaScript

Accessing this url on an html page we can see the Properties and Values of this subject.

Counting each triple at a specific rdf graph database can be done by:

SELECT COUNT(*) {?subject ?property ?object.}

WHERE ?subject, ?propery and ?object they variables selecting any possible match in an rdf graph.

So there are four possible selection to an rdf graph.With three variables: ?subject ?property ?object (and filtering the result by our need)

With two variables

With one variables

And when we know each value. (In this case it is possible we just want to know if they exists in the graph)

DEFINING AND USING PREFIXES

For more convenient use In SPARQL it is possible to create an alias for each url. This alias is called prefix.Dbpedia has a lot of predefined prefixes, listed here:

http://dbpedia.org/sparql?nsdecl

In SPARQL a prefix is created with the following syntax:

PREFIX alias:

So example the previous JavaScript subject url can have a prefix with the following definition:

PREFIX javascript:

QUERING the JavaScript resource normally is done by this query:

SELECT ?js_label WHERE { rdfs:label ?js_label.}

Using a prefix, it's possible to use like this:

PREFIX javascript: SELECT ?js_label WHERE {javascript: rdfs:label ?js_label.}

This query will return the label on all possible languages. If we want to see only the english version we can add a filter for this:

PREFIX javascript: SELECT ?js_label WHERE {javascript: rdfs:label ?js_label.FILTER (LANG(?js_label)='en')}

The result now is only the label for the english language. We see that there is a concatenated @en telling the language. If we don't want to display that, we can use the STR function to cast the result into a string.

PREFIX javascript: SELECT str(?js_label) AS ?js_label WHERE {javascript: rdfs:label ?js_label.FILTER (LANG(?js_label)='en')}

I used the AS operator to add a custom name for the row in the result.

QUERIES FOR EXACT TERMS

Finding subjects by label for a specific language

finding types and type labels for subjects

casting labels into strings

searching case insensitive labels

exclude some dbpedia paths like yago

Most of the time, we don't know the subject, but we know the label name. In this case, the subject can be searched using the query:

SELECT ?subject WHERE {?subject rdfs:label "JavaScript"@en.}

This will find each subject, having the label JavaScript in english. Note, that we should add the language selector after our string, otherwise we wont have any result.

Most of the subjects have one or more TYPE property. These types can be accessed with multiline queries for a given label.Multiple triple selections are separated by the . (dot)

Listing all the types for the JavaScript subject could be done by:

SELECT ?subject ?types WHERE {?subject rdfs:label "JavaScript"@en.?subject rdf:type ?types.}

Having the same subject used on two ore more consecutive triple selection, could be written with this short form:

SELECT ?subject ?types WHERE {?subject rdfs:label "JavaScript"@en; rdf:type ?types.}

Note that the first selection has a ; (semicolon) at the and, the next line has only two resource. The subject will be the same like at the first line.

This selection will show the links to each type. If we want to see the LABEL for these types we must add two other line for this query:

SELECT ?subject ?types STR(?type_label) WHERE {?subject rdfs:label "JavaScript"@en; rdf:type ?types.?types rdfs:label ?type_label.FILTER (LANG(?type_label)='en')}

Most of the time we don't know the exact case of the term in dbpedia we want to search. We could transform into lowercase, and doing filter by that to find the subjects.

SELECT ?subject {?subject rdfs:label ?subject_label.FILTER(LANG(?subject_label)='en')filter(lcase(str(?subject_label)) = 'javascript')}

Sometimes we select some terms but we want to exclude some specific url-s form the result. This selection will show the types of the word Barcode. The result has a lot of YAGO class domain.

SELECT DISTINCT ?label { rdf:type ?type. rdfs:label ?label.FILTER (LANG(?label)='en')}

To Exclude these Yago domains, a filter should be added.

SELECT DISTINCT ?label { rdf:type ?type. rdfs:label ?label.FILTER (LANG(?label)='en')FILTER (!STRSTARTS(STR(?type), "http://dbpedia.org/class/yago/") )}

FINDING TERMS IN CATEGORIES

use of categories

use of recursive search on categories

Dbpedia has a lot of categories. And sometimes we want to get subjects for a specific category.Let's say we want to show each subject in the Occupations category.We can achieve that by:

SELECT ?subject {?subject dcterms:subject category:Occupations. }

and showing the labels instead of urls:

SELECT str(?subject_label) {?subject dcterms:subject category:Occupations; rdfs:label ?subject_label.}

We can observe, that we obtained a very small list of occupations doing this selection. This is because dbpedia have a lot of subcategories for each occupation, and not each is related directly to the category:Occupations, but most of them having a path through other nodes to this.If we want to have more occupation listed, we can specify a recursive selection of the subcategories:

SELECT str(?subject_label) str(?category_label) {?subject dcterms:subject ?category; rdfs:label ?subject_label.?category skos:broader{,1} category:Occupations;rdfs:label ?category_label.FILTER(LANG(?category_label)='en').}

This will list each terms related to category:Occupations directly or by one depth level specified by skos:broader{,1}.Infinite depth could be selected by skos:broader+ but most probably will be very slow.

SEARCHING TERMS IN ONTOLOGIES

limiting search results

searching in ontologies

using regexp filter on results

using fulltext search

What about listing the first 1000 university from dbpedia?

PREFIX univ_ontology: SELECT str(?university_label) {?univ_subject rdf:type univ_ontology:;rdfs:label ?university_label.FILTER(LANG(?university_label)='en')} Limit 1000

Most of the time, we know only part of the label, and we want to check if that is a term related to somewhere.For instance, let's say we have the term Petru Maior and we want to check if this is part of the name of a university.For this, one solution is to us a regexp filter on the result with this string.

PREFIX univ_ontology: SELECT str(?university_label) {?univ_subject rdf:type univ_ontology:;rdfs:label ?university_label.FILTER(LANG(?university_label)='en')FILTER(REGEX(?university_label, "Petru Maior", "i"))}

Notice the i parameter for regex, that means case insensitive.The result is Petru Maior University of Trgu Mure which is good, but notice that the search was very slow. For better performance we can use a fulltext search.

PREFIX univ_ontology: SELECT str(?university_label) {?univ_subject rdf:type univ_ontology:;rdfs:label ?university_label.?university_label bif:contains "'Petru Maior'".FILTER(LANG(?university_label)='en')}

Instead of regexp the result, we used ?university_label bif:contains "'Petru Maior'". Which will do a fulltext search on each university label. Notice that we had a dramatic speed increase with the same result.

DBPEDIA DISAMBIGUATES

getting the types of subjects (kind of tagging)

working with disambiguated terms.

Using UNION

Sometimes a subject can have more meaning that what we have as a first result from Dbpedia. For example lets query the type of the label Apache.

SELECT STR(?type_label) WHERE {?subject rdfs:label "Apache"@en; rdf:type ?types.?types rdfs:label ?type_label.FILTER (LANG(?type_label)='en')}

As a result, we got concept, and enthnic group. But we know that Apache could be also an http server. How we can include in the result also that?This is where we can use the dbpedia disambiguates feature.For this we need to find the disambiguation url for this term. This is done by:

SELECT DISTINCT ?subject_disambiguation_url{?subject rdfs:label "Apache"@en;rdf:type ?subject_type.?subject_disambiguation_url dbpedia-owl:wikiPageDisambiguates ?subject.}

Now that we have this url, we can use to select the subjects having the same disambiguation url.

SELECT DISTINCT ?disamb_subjects{?subject rdfs:label "Apache"@en;rdf:type ?subject_type.?subject_disambiguation_url dbpedia-owl:wikiPageDisambiguates ?subject.?subject_disambiguation_url dbpedia-owl:wikiPageDisambiguates ?disamb_subjects.}

WE see that there is a couple, and the one we needed is also there:

http://dbpedia.org/resource/Apache_HTTP_Server

Now we can list each type for each of these disambiguated subjects related to Apache.

SELECT DISTINCT ?disamb_subjects_types{?subject rdfs:label "Apache"@en;rdf:type ?subject_type.?subject_disambiguation_url dbpedia-owl:wikiPageDisambiguates ?subject.?subject_disambiguation_url dbpedia-owl:wikiPageDisambiguates ?disamb_subjects.?disamb_subjects rdf:type ?disamb_subjects_types.}

And of course we can show only the labels for these, and in english.

SELECT DISTINCT str(?disamb_subjects_labels){?subject rdfs:label "Apache"@en;rdf:type ?subject_type.?subject_disambiguation_url dbpedia-owl:wikiPageDisambiguates ?subject.?subject_disambiguation_url dbpedia-owl:wikiPageDisambiguates ?disamb_subjects.?disamb_subjects rdf:type ?disamb_subjects_types.?disamb_subjects_types rdfs:label ?disamb_subjects_labels.FILTER(LANG(?disamb_subjects_labels)='en')}

What about if we have multiple terms, and we want to show disambiguates for each of them in one query? For this we can use UNION in this way:

SELECT DISTINCT ?subject str(?disamb_subjects_labels){{?subject rdfs:label "Apache"@en.} UNION {?subject rdfs:label "Java"@en.}?subject rdf:type ?subject_type.?subject_disambiguation_url dbpedia-owl:wikiPageDisambiguates ?subject.?subject_disambiguation_url dbpedia-owl:wikiPageDisambiguates ?disamb_subjects.?disamb_subjects rdf:type ?disamb_subjects_types.?disamb_subjects_types rdfs:label ?disamb_subjects_labels.FILTER(LANG(?disamb_subjects_labels)='en')}

We used curly braces to group and unify the two terms.

{?subject rdfs:label "Apache"@en.} UNION {?subject rdfs:label "Java"@en.}

The result will show all the disambiguates for Apache and than Java.Of course, the subject urls could be transformed into labels adding another line ?subject rdfs:label ?subject_label and selecting this instead of subject.

MULTIPLE PURPOSE CHECKING

how to check to which specific domains a term is related to

This example shows, how we can use SELECT and UNION to retrieve if a term is related to a PROFESSION, CITY, TOWN, COUNTRY, EDUCATIONAL INSTITUSION.

select (count(?is_profession) as ?is_profession) (count(?is_city) as ?is_city)(count(?is_town) as ?is_town)(count(?is_country) as ?is_country)(count( ?is_edu_inst) as ?is_edu_inst)

where{{ ?is_profession rdfs:label "Sovata"@en.FILTER(EXISTS { ?is_profession dbpprop:type dbpedia:Profession }) }UNION{ ?is_city rdfs:label "Sovata"@en.FILTER(EXISTS { ?is_city rdf:type dbpedia-owl:City }) }UNION{ ?is_country rdfs:label "Sovata"@en.FILTER(EXISTS { ?is_country rdf:type dbpedia-owl:Country }) }UNION {?is_town rdfs:label "Sovata"@en.FILTER(EXISTS { ?is_town rdf:type dbpedia-owl:Town }) }UNION {?is_edu_inst rdfs:label ?lab.filter(lang(?lab)='en')filter(regex(?lab, "Petru Maior", 'i'))FILTER(EXISTS { ?is_edu_inst rdf:type dbpedia-owl:EducationalInstitution}) }

}