μraptor: a dom-based system with appetite for hcard elements
DESCRIPTION
Winner system of the Linked Data for Information Extraction Challenge 2014, LD4IE at ISWCTRANSCRIPT
μRaptor A DOM based system with appetite for hCard elements
μRaptor
is hungry
Training Phase
Clean the HTML
Training Phase
Clean the HTML
DOM sub-trees
Training Phase
Clean the HTML
DOM sub-trees
CSS class co-occurrence
author
Training Phase
Clean the HTML
DOM sub-trees
CSS class co-occurrence
CSS Selectors
Training Phase
Clean the HTML
DOM sub-trees
CSS class co-occurrence
Value Constraints
CSS Selectors
vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . ALL_LOWERCASE
vcard:email mailto : ALPHA PUNCTUATION ALL_LOWERCASE . com
vcard:email mailto : ALPHA @ ALPHANUMERIC . ALL_LOWERCASE
vcard:email mailto : ALPHA @ ALPHANUMERIC . com
vcard:email mailto : ALL_UPPERCASE ****@ ALL_LOWERCASE . ALL_LOWERCASE
vcard:bday NUMBER - SMALL_NUMBER - SMALL_NUMBER
vcard:bday MEDIUM_NUMBER - SMALL_NUMBER - SMALL_NUMBER
We could determine patterns for emails for example:
… or even for birthdays
Extraction Phase
Clean the HTML
DOM sub-trees
CSS class co-occurrence
Value Constraints
Pattern Detection
CSS Selectors
Extraction Phase
Clean the HTML
DOM sub-trees
CSS class co-occurrence
Value Constraints
Pattern Detection
Elements Qualification
CSS Selectors
Clean the HTML
DOM sub-trees
CSS class co-occurrence
Value Constraints
Pattern Detection
Elements Qualification
Models Validation
CSS Selectors
Extraction Phase
RDF Model From μRaptor
RDF Model Test set
?
= 0.94 = 0.7 = 0.8
μRaptor
https://github.com/emir-munoz/uraptor
We made the discovery of the new μRaptor
species and I am very pleased some researchers
helped us understanding its feeding habits
Godzilla is a doll compared to μRaptor! I am
currently working on a script for an upcoming
movie
As a kid I always wanted to see an actual
dinosaur. Today my dream comes true
Damn, he is better than me!