semweb install-fest presentation
TRANSCRIPT
Zemanta Getting Personal
Building upon the Zemanta API
Andraz Tori, [email protected]: andraz
Overview
General purpose
Functionality
Examples, demos & use-cases
What does it do?
A Stargate
ComputerProcessableDataHumanUnderstandableText
=
Initial design
Input: a chunk of text
Domain agnostic!
Avoid proprietary entity identifiers or taxonomies
Standard response formats: JSON, XML, RDF/XML
Cross domain we didn't start with financial or health domain and then expanded our algorithms, we started from day one with cross domain capabilities
Most used What gives?
Most interesting Most obvious Tags
Categories
Concepts and entities
Related articles
Related images
Tags
Words, phrases
Interesting tags Explicitely mentioned
What the text is about as a whole
What concepts were not mentioned, but could be relevant (for SEO)
Tags have no background meaning, they are not tied to any database and they are not normalized in any way. They are what you would expect of a human not caring for standardization or normalization to choose fromFor example text mentioning Apple, Android and Google might get iPhone as a tagAnd mobile web as a tag, even when it wasn't mentioned anywhere.
Categories
Deep hirarchy (100k categories)
Customized smaller taxonomies
Good for content organization, ad-targeting, etc
Tags have no background meaning, they are not tied to any database and they are not normalized in any way. They are what you would expect of a human not caring for standardization or normalization to choose fromFor example text mentioning Apple, Android and Google might get iPhone as a tagAnd mobile web as a tag, even when it wasn't mentioned anywhere.
Categories example
Branded "unfilmable", Watchmen - the cult graphic novel about a group of retired, flawed superheroes - has finally made it to the big screen. From the second the opening credits roll, it is clear Watchmen is not your typical superhero movie.
An ageing vigilante, The Comedian, is attacked in his high-rise apartment before being hurled 10 storeys to his death... in graphic slow motion. What follows is a two-and-three-quarter hour epic that centres on an outlawed group of deeply flawed former heroes as a Cold War Doomsday clock inches ever closer to midnight and nuclear apocalypse.
First published in 12 parts by DC Comics in 1986, Watchmen was written by the British team of Alan Moore and illustrator Dave Gibbons.
Categories
Top/Society/History/By_Time_Period/Twentieth_Century/Cold_War (0.11)
Top/Arts/Comics/Reviews (0.10)
Top/Society/History/By_Time_Period (0.08)
Top/Arts/Comics (0.08)
Top/Society/History/By_Time_Period/Twentieth_Century (0.08)
Top/Society/History (0.08)
Top/Shopping/Publications/Books (0.08)
Top/Shopping/Publications/Books/Fiction (0.08)
Categories example
Branded "unfilmable", Watchmen - the cult graphic novel about a group of retired, flawed superheroes - has finally made it to the big screen. From the second the opening credits roll, it is clear Watchmen is not your typical superhero movie.
An ageing vigilante, The Comedian, is attacked in his high-rise apartment before being hurled 10 storeys to his death... in graphic slow motion. What follows is a two-and-three-quarter hour epic that centres on an outlawed group of deeply flawed former heroes as a Cold War Doomsday clock inches ever closer to midnight and nuclear apocalypse.
First published in 12 parts by DC Comics in 1986, Watchmen was written by the British team of Alan Moore and illustrator Dave Gibbons.
Concepts and entities
Identify relevant concepts and entities
All disambiguated!
At least one URL for each concept, possibly more
Disambiguation is done using background knowledge, for example we differ between London the city in UK, London in Ohio or Texas and Jack London, the writer
How we disambiguate
Use knowledge from Wikipedia, Freebase, Dmoz, third party databases...
Mine the web
Use knowledge from choices of our users
Use both semantic data and statistics based methods
Linking to...
Traditional
.........
Semantic
How to build upon this
Step 1: We give you exact identifiers
Step 2: Then you look up the information about them (connections, images, ) in your or third party databases
Step 3: ?
Step 4: Profit!
We are big fans of Freebase and Linking Open Data project
Discovery example
A US Airways Airbus A320 passenger plane carrying 135 people has crashed into the Hudson River in New York, the Federal Aviation Administration says.
Rescue boats and ferries are alongside the plane attempting to pick up people standing on both of the plane's wings.
The plane, which the FAA said was flight 1549 from LaGuardia Airport to Charlotte, is partially submerged.
It is not known how the plane came to land in the river, but the FAA said it might have been due to a bird strike.
You get
A US Airways Airbus A320 passenger plane carrying 135 people has crashed into the Hudson River in New York, the Federal Aviation Administration says.
Rescue boats and ferries are alongside the plane attempting to pick up people standing on both of the plane's wings.
The plane, which the FAA said was flight 1549 from LaGuardia Airport to Charlotte, is partially submerged.
It is not known how the plane came to land in the river, but the FAA said it might have been due to a bird strike.
entitiesconcepts
Or more precisely...
LaGuardia Airport
http://rdf.freebase.com/ns/guid/9202a8c04000641f800000000018f654
LaGuardia Airport
http://dbpedia.org/resource/LaGuardia_Airport
Federal Aviation
Administrationhttp://rdf.freebase.com/ns/guid/9202a8c04000641f8000000000017df0
Federal Aviation Administration
http://dbpedia.org/resource/Federal_Aviation_AdministrationHudson
River
http://rdf.freebase.com/ns/guid/9202a8c04000641f800000000005ebb5
Hudson River http://dbpedia.org/resource/Hudson_River
Airbus A320 family
http://rdf.freebase.com/ns/guid/9202a8c04000641f800000000012f918
Airbus A320 family
http://dbpedia.org/resource/Airbus_A320_family
Bird strike
http://rdf.freebase.com/ns/guid/9202a8c04000641f80000000004744df
Bird strike http://dbpedia.org/resource/Bird_strike
US Airways
http://rdf.freebase.com/ns/guid/9202a8c04000641f80000000001b4dc5
US Airways http://dbpedia.org/resource/US_Airways
New York
http://rdf.freebase.com/ns/guid/9202a8c04000641f800000000054dd5d
New York http://dbpedia.org/resource/New_York
Charlotte, North Carolina
http://rdf.freebase.com/ns/guid/9202a8c04000641f800000000006e148
Charlotte, North Carolina
http://dbpedia.org/resource/Charlotte%2C_North_Carolina
Ferr
http://rdf.freebase.com/ns/guid/9202a8c04000641f8000000000063292
Ferry at http://dbpedia.org/resource/Ferry
You can query relationships
http://test.infoblow.zemanta.com/infoblow/galaxy/
Or more complex ones...
Concepts and entities
use cases
Quick 'overviews' of topics
Discovery-supporting user interfaces
Automatic deep information delivery (hoovers, widgets)
Balloons example
Deliver deep information on exact concepts and entities
Fantastic public graph
Information about concepts/entities
Types: human, building, location...
Relationships with other entities
Hard data: dates, places, amounts
Connected Dream?
September 2008
Connected Dream?
July 2009
Opportunities in leveraging linked data
There are internal and external benefits of linking into larger pool of exact data
Pulling together custom data becomes orders of magnitude easier
However we still miss strong success stories
Related articles
20k blogs and media sites
You can provide your own list of feeds to recommend from
Or use our 'global whitelisted pool'
Related articles use cases
Better experience for the readers
Information discovery (for authors)
Creating interlinked mini-comunities (example: bloggers using our tool to discover others in the niche)
Related images
From Wikipedia, Flickr, Daylife, Amazon, Last.fm, Snooth, social networks
We filter totally unacceptable licenses out, keep the rest
Each image has a license spelled out, developer/author choses
Zemanta API
http://developer.zemanta.com
Examples in Java, Javascript, Python, Ruby, PHP, Perl, C#...
JavaScript SDK for quick custom CMS integration
Up to 10.000 requests/day free!
Ease of API use
import urllib, simplejson, pprint
args = {'format': 'json',
'method': 'zemanta.suggest',
'api_key': 'np9cbnby9x8tsc47recwuhqm',
'return_categories': 'dmoz',
'return_rdf_links': 1,
'text': ''' Branded "unfilmable", Watchmen - the cult graphic novel about a group of retired, flawed superheroes - has finally made it to the big screen. From the second the opening credit An ageing vigilante, The Comedian, is attacked ...
'''}
args_enc = urllib.urlencode(args)
response_raw = urllib.urlopen(http://api.zemanta.com/services/rest/0.0/, args_enc).read()
response = simplejson.loads(response_raw)
pprint.pprint(response)
Works for
All kinds of texts (not just financial or journalistic articles)
Tweets!
Wherever you need to go from text documents to something structured to put into your algorithm/data store
Some API users
How the API is used?
Place extraction and disambiguation used by Outside.in
Analysis of tweets used by Klout.net
Custom categorization used by Slideshare
Semantic tagging used by Faviki
CommonTag
Initiative by AdaptiveBlue, DERI (NUI Galway), Faviki, Freebase,
Yahoo!, Zemanta, and Zigtag
Exact tagging
RDFa as a transport layer
Freebase & LOD as vocabularies
Full-circle ecosystem from day one (publishers, services, better search, better browsing)
Zigtag, Faviki, AdpativeBlue, Zemanta, Yahoo, Freebase
The next web
... the next web will be like a great party host, introducing us to each other and bringing us together into meaningful conversation.Marta Strickland, Organic
The future?
Zemify me up, Scotty!
Andraz [email protected]: andraz
Image attributions
http://www.flickr.com/photos/constanzavolare/2475833775/in/photostream/
CC by Constanza Volare
Disambiguation is done using background knowledge, for example we differ between London the city in UK, London in Ohio or Texas and Jack London, the writer
Click to edit the title text format
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level