Download - Semantic Searchmonkey
Monkey with the Semantic Web
SearchMonkey
Presentation by:
Paul Tarjan, Chief Technical Monkey
Online at:
http://www.slideshare.net/ptarjan/semantic-searchmonkey
The web was / is fragmented
University event page
Friend’s website
Cool bookmarks
Super secret military site
Funny pictures
So we added search to find stuff
University event page
Friend’s website Cool
bookmarks
Super secret
military site
Funny pictures
Google Yahoo
But there are many similar sites
Facebook Events Evite Events Upcoming Events
Youtube Metacafe Vimeo
Digg Reddit Technorati
Let’s treat these as “views” onto “objects”
Wouldn’t it be cool if you could do:
• object:video creator:”Paul Tarjan” length<=60s
Wouldn’t it be cool if you could do:
• object:video creator:http://paulisageek.com/ length<=60s
Wouldn’t it be cool if you could do:
• object:game name:”Desktop Tower Defense” version:1.5 publishdate:”May 2 2005”
Wouldn’t it be cool if you could do:
• object:video author:”The Escapist” game:”Left 4 Dead”
It gets even cooler
Aggregation:
• object:review type:camera make:canon model:D40
Aggregation:
• object:event date:”May 16, 2008” type:party price<$5
Aggregation:
• object:photo person:“Paul Tarjan”
Aggregation:
• object:photo person:http://paulisageek.com
The Semantic What?
• Web pages are views of data for people to read
• Search Engines are a hack • They treat pages as a bucket of words • Lets turn the web into a database • APIs are good, but there is no “web” of APIs • If you figure out a good way of doing that, let
me know
Ok, I want to do it. Now what?
Recommendation: µF
• If there is a microformat for your data, use it – hcard – hreview – hresume – hcalendar – rel-tag – rel-licence – xfn – hatom – geo
µF in a nutshell
• Change your @class to something that is known • <div>
– <span class=“name”>Paul Tarjan</span> – <span class=‘email’>[email protected]</span>
• </div> • BECOMES • <div class=“vcard”>
– <span class=“fn”>Paul Tarjan</span> – <span class=“email”>[email protected]</span>
• </div>
Recommendation: RDFa
• If you have data that doesn’t really fit in a µF
• Examples: – Markup APIs (YUI, javadoc, etc) – Media (Audios, Videos, Games, Presentations) – Job Postings
RDFa in a nutshell
• Make a namespace • Use @property, @rel and @resource • For DATA: @property makes the node
contents into the value • For URLs: @rel makes the @resource into
the value
Normal HTML
• <html> …
<div class="private”> private static String <strong>_createCookieHash </strong> (hash) …
RDFa: example
• <html xmlns:yui="http://yuilibrary.com/rdf/1.0/yui.rdf#"> …
<div class="private” rel="yui:method" resource="#method__createCookieHash"> private static String <strong property="yui:name"> _createCookieHash </strong> (hash) …
That’s it!
• Automatically picked up by semantic parsers / crawlers
• Can build a SearchMonkey app on it • Can make a mashup way easier than screen
scraping • Can get the data from Yahoo! BOSS
an open platform for using structured data to build more useful and relevant search results
Before After
What is SearchMonkey?
Enhanced Result: Zagat
Key/Value Pairs or Abstract
Links Image
Infobar: Wikipedia Preview
Summary Blob
Part of the puzzle
SearchMonkey
Semantic markup on web pages
Semantic vocabularies
Vocabularies
• Need to speak the same language • I like to see girls of that... caliber. • English, French, Spanish, Esparanto? • URLs to the rescue
– Dublin Core (http://purl.org/dc/elements/1.1/) – Friend of a Friend (http://xmlns.com/foaf/0.1/) – X-Friend Network (http://gmpg.org/xfn/11/) – … (many more)
Syntax
• Nouns, Verbs, and Adjectives, oh my! • All phrases become lots of triples • (Subject, Verb / Adj. / Prep. / etc, Object) • Key / Value pairs ++
– Everything is a URL or String – Subject doesn’t have to be the document
Syntax 2
• Key / Value pair – Title = Awesome SearchMonkey Presentation – Homepage =
http://search.yahoo.com/searchmonkey
• Triples – (self, http://purl.org/dc#title, “Awesome
SearchMonkey Presentation”) – (self, http://vcard#url,
http://search.yahoo.com/searchmonkey)
Decompose to triples
• My friend “Bob” is an idiot. – (self, http://xmlns.com/foaf/0.1/knows,
genid:Ui__152310312_366) – (genid:Ui__152310312_366, http://
www.w3.org/2001/vcard-rdf/3.0#fn, “Bob”) – (genid:Ui__152310312_366, http://
example.org/ptarjan/isInstanceOf, http://example.org/ptarjan/idiot)
• Unnamed nodes are O.K.
Writing URLs takes a lot of work!
• xmlns:foaf=http://xmlns.com/foaf/0.1/ • xmlns:vcard=http://www.w3.org/2001/vcard-rdf/
3.0# • xmlns:junk=http://example.org/ptarjan/ • My friend “Bob” is an idiot.
– (self, foaf:knows, genid:Ui__152310312_366) – (genid:Ui__152310312_366, vcard:fn, “Bob”) – (genid:Ui__152310312_366, junk:isInstanceOf, junk:idiot)
• Unnamed nodes are O.K.
RDFa
• <html xmlns:foaf=“http://xmlns.com/foaf/0.1” xmlns:vcard=http://www.w3.org/2001/vcard-rdf/3.0# xmlns:junk=http://example.org/ptarjan/> <div rel=“foaf:knows”> <span property=“vcard:fn”>Bob</span> <span rel=“junk:isInstanceOf” resource=“junk:idiot” /> </div> </html>
• </SemanticWeb>
• Questions?
Innards of SearchMonkey
• You build a web-service inside our framework
• When a search page renders – We check which SM apps are enabled – We call them
• 50ms for in-page • Long time for AJAX
– They return data in our template – We render them (and cache)
Prototyping with XSLT
• What if I don’t have structured data? – I don’t own the site – I do own the site, but I want to prototype first
• Build an XSLT custom data service first – Write some XSLT to extract the data and
transform it into DataRSS – Mostly about finding the right XPath (use
Firebug or XPather ) – Quick to implement, but brittle – Can’t do a good Enhanced Result
Do it for real
• Demo
Examples
• Rubic’s cube • VTA Bus • API Monkey • BugMeNot • RetailMeNot • Amazon
questions?