geotagging web content a where - ibm...
TRANSCRIPT
WebWeb--aa--Where:Where:Geotagging Web ContentGeotagging Web Content
Einat AmitayNadav Har’El
Ron SivanAya Soffer
Feb 11, 2004
IBM ResearchIBM Research
Ronny LempelWebWeb--aa--WhereWhere
11 Feb 2004 Web-A-Where - Geotagging Web Content 2
What is the geography of a page?What is the geography of a page?
11 Feb 2004 Web-A-Where - Geotagging Web Content 3
What is the geography of a page?What is the geography of a page?
11 Feb 2004 Web-A-Where - Geotagging Web Content 4
What is the geography of a page?What is the geography of a page?
11 Feb 2004 Web-A-Where - Geotagging Web Content 5
What is the geography of a page?What is the geography of a page?
� Environment: �URL, domain�Owner�Server locale, IP address
11 Feb 2004 Web-A-Where - Geotagging Web Content 6
What is the geography of a page?What is the geography of a page?
11 Feb 2004 Web-A-Where - Geotagging Web Content 7
What is the geography of a page?What is the geography of a page?
<html> <head> <title>iGuide - Israeli Internet Guide</title> <meta name=keywords content="iGuide, israel, guide, classified, hebrew, news,
il, tel-aviv, jerusalem, haifa, internet, www, Neystadt, map, physical, logical, line, router, bandwidth, site, category">
<meta name=description content="Israeli Internet Guide and Search engine in English and Hebrew. Provides a updated complete and categorical lists of Israeli sites, maps of ISPs, news.">
</head> <body>
<html> <head> <title>iGuide - Israeli Internet Guide</title> <meta name=keywords content="iGuide, israel, guide, classified, hebrew, news,
il, tel-aviv, jerusalem, haifa, internet, www, Neystadt, map, physical, logical, line, router, bandwidth, site, category">
<meta name=description content="Israeli Internet Guide and Search engine in English and Hebrew. Provides a updated complete and categorical lists of Israeli sites, maps of ISPs, news.">
</head> <body>
11 Feb 2004 Web-A-Where - Geotagging Web Content 8
What is the geography of a page?What is the geography of a page?
� Environment: external to page�URL, domain�Owner�Server locale, IP address
� Content: text and metatext�Place names�Adjectives (Israeli), language (Hebrew), currency (NIS)�Link analysis
11 Feb 2004 Web-A-Where - Geotagging Web Content 9
What is the geography of a page?What is the geography of a page?
� Environment: external to page�URL, domain�Owner�Server locale, IP address
� Content: text and metatext�Place names�Adjectives (Israeli), language (Hebrew), currency (NIS)�Link analysis
11 Feb 2004 Web-A-Where - Geotagging Web Content 10
The processThe process
� Spotting�Takes names to spot from the gazetteer
� Disambiguate individual instances�Takes taxonomies from the gazetteer
� Find page foci
11 Feb 2004 Web-A-Where - Geotagging Web Content 11
The problemThe problem
� Find names of places�Must distinguish from non-geo
�Is “London” a place or a person? (e.g., Jack London)� Determine the meaning
�Must arbitrate between competing senses�Is “London” in England or Ontario, Canada?
� Finding the focus�must combine geotags meaningfully
11 Feb 2004 Web-A-Where - Geotagging Web Content 12
The gazetteerThe gazetteer
� Contents�Name, synonyms, abbreviations�Population, coordinates, is-capital
� Sources we used�GNIS – 2M features in the US�World-gazetteer – 50K cities world wide�UNSB, ISO-3166, USPS
� XML format�Imposes a hierarchical structure
<geographic-location level="city" label="San Jose"><population>816884</population><coordinates>
<latitude>37.33944</latitude><longitude>-121.89389</longitude>
</coordinates><names-to-spot>
<name>San Jose</name></names-to-spot><geographic-locations></geographic-locations>
</geographic-location>
11 Feb 2004 Web-A-Where - Geotagging Web Content 13
DisambiguationDisambiguation
� A bag of tricks�Qualified names�Default name sense�Single sense per discourse�Names with strong non-geo sense�Unifying locality
11 Feb 2004 Web-A-Where - Geotagging Web Content 14
Qualified namesQualified names
� Names qualified by their next higher taxonomy level�Haifa, Israel�San Jose, California, USA
�Definitely not the San Jose of Costa Rica, Guatemala, Bolivia, Peru or the Philippines
�Sacramento Calif.�“Rumsfeld and Wolfowitz lobbied Clinton in 1998 to start Iraq war and topple Saddam.” (Online Journal, Feb 20, 2003)
11 Feb 2004 Web-A-Where - Geotagging Web Content 15
Default senseDefault sense
� The place having the highest population�Portland, IN: 6,384�Portland, ME: 61,982�Portland, OR: 450,777�Portland, TN: 5,708�Portland, TX: 13,642�Portland, Victoria, Australia: 10,455_________________________________�Portland, OR
11 Feb 2004 Web-A-Where - Geotagging Web Content 16
Single sense per discourseSingle sense per discourse
� Single sense per discourse �A qualified instance induces its meaning on other,
unqualified instances in the same context (page)
11 Feb 2004 Web-A-Where - Geotagging Web Content 17
Single sense per discourseSingle sense per discourse
� Single sense per discourse �A qualified instance induces its meaning on other,
unqualified instances in the same context (page)� Can lead to difficulties
�As we shall soon see
11 Feb 2004 Web-A-Where - Geotagging Web Content 18
Names w/strong nonNames w/strong non--geo sensegeo sense
� Some place names are also common words�Mobile, Al or Reading, GB�Are ignored unless qualified
� Some names are also very very common words�To, Myanmar, Of, Turkey or As, Belgium
11 Feb 2004 Web-A-Where - Geotagging Web Content 19
As, Belgium
11 Feb 2004 Web-A-Where - Geotagging Web Content 20
Names w/strong nonNames w/strong non--geo sensegeo sense
� Some place names are also common words�Mobile, AL or Reading, GB�Are ignored unless qualified
� Some names are also very very common words�To, Myanmar, Of, Turkey or As, Belgium�Excluded altogether
11 Feb 2004 Web-A-Where - Geotagging Web Content 21
Names w/strong nonNames w/strong non--geo sensegeo sense
� Can be found from analyzing well-edited text�Count capitalized and non-capitalized occurrences�For most unambiguous place names –
�Frequency is related to importance (population)�Cap to non-cap ratio is high
�Names with strong non-geo sense stick out
11 Feb 2004 Web-A-Where - Geotagging Web Content 22
Names w/strong nonNames w/strong non--geo sensegeo sense
� Currently, a manual pass is required, however�False positive: Aspen, CO
�Appears frequently�Has only 5,000�Therefore included in the list�But most occurrences actually refer to it…
�False negative: Metro, Indonesia�Appears frequently�Has 100,000 inhabitants�Therefore excluded from the list�But most occurrences of “Metro” refer to Metro Area…
11 Feb 2004 Web-A-Where - Geotagging Web Content 23
Unifying localityUnifying locality
� Unifying locality�A taxonomy level which fits many of the remaining
low-confidence disambiguations�Within this locality, those names are unique
Wien/Austria/EuropeAlexandria/Egypt/AfricaAlexandria/Italy/EuropeAlexandria/Romania/EuropeAlexandria/Ukraine/EuropeAlexandria/Indiana/US/North AmericaAlexandria/Kentucky/US/North AmericaAlexandria/Louisiana/US/North AmericaAlexandria/Minnesota/US/North AmericaAlexandria/Virginia/US/North AmericaAlexandria/Brazil/South America
Vienna/Virginia/US/North AmericaVienna/West Virginia/US/North America
Alexandria/Egypt/AfricaAlexandria/Egypt/AfricaAlexandria/Italy/EuropeAlexandria/Romania/EuropeAlexandria/Ukraine/EuropeAlexandria/Indiana/US/North AmericaAlexandria/Kentucky/US/North AmericaAlexandria/Louisiana/US/North AmericaAlexandria/Minnesota/US/North AmericaAlexandria/Virginia/US/North AmericaAlexandria/Brazil/South America
Vienna/Virginia/US/North AmericaVienna/West Virginia/US/North America
11 Feb 2004 Web-A-Where - Geotagging Web Content 24
Individual spot disambiguationIndividual spot disambiguation
� Three collections of Web pages�Gov: pages from .gov (1.2M)�ODP: the Regional tab of the ODP directory (600K)�Arb: What Google returned for “+the +and” (1.2K)
� Randomly selected 200 pages from each� Ran the GeoMiner
�Medium-sized gazetteer� Tags given to spots were checked manually
11 Feb 2004 Web-A-Where - Geotagging Web Content 25
Results of individual spotsResults of individual spots
Gov ODP Arb
correct73%
geo/non-geo20%
geo/geo7%
correct63%
geo/non-geo21%
geo/geo16%
correct82%
geo/non-geo
11%
geo/geo7%
11 Feb 2004 Web-A-Where - Geotagging Web Content 26
Page focusPage focus
� Contributions from each disambiguated mention�Depends on confidence�Contribution to each hierarchy level
� Highest scoring levels selected as foci�If they exceed a threshold�Are not part of or include another focus
11 Feb 2004 Web-A-Where - Geotagging Web Content 27
Results of focus accuracyResults of focus accuracy
� Focus given to 55% of the pages�75% of the pages having at least 1 geotag
� Of the focused pages, 80% had but one focus
11 Feb 2004 Web-A-Where - Geotagging Web Content 28
too narrow
4%
too broad45%
precise39%
wrong12%
Results of focus accuracyResults of focus accuracy
0% 20% 40% 60%
1 2 13
0% 2% 4% 6% 8% 10% 12%
1 2 1 03
11 Feb 2004 Web-A-Where - Geotagging Web Content 29
Precision vs. gazetteer sizePrecision vs. gazetteer size
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
1000100001000001000000
Size of smallest cities in gazetteer
Num
ber o
f pag
es
pages w/spotspages w/focuscorrectmissing 1
11 Feb 2004 Web-A-Where - Geotagging Web Content 30
Search results Search results (Reuters corpus 1996)(Reuters corpus 1996)
All Hitech
11 Feb 2004 Web-A-Where - Geotagging Web Content 31
Search results Search results (pharmaceutical corpus)(pharmaceutical corpus)
AIDS SARS
11 Feb 2004 Web-A-Where - Geotagging Web Content 32
Future workFuture work
� Reduce geo/non-geo error rate�Ignore person names
� Use coordinate data�Newark in the context of Manhattan is in NJ, not NY
� Expand the gazetteer�Non-incorporated places�Counties, shires, cantons, etc.�Even smaller places
� Grammatical text analysis
11 Feb 2004 Web-A-Where - Geotagging Web Content 33
ConclusionsConclusions
� Web-a-where does a reasonable job of identifying place names in noisy Web pages�Some novel heuristics�Could do better…�…and run faster
� Performance tests �Not found in most publications (including MetaCarta)�Can help in designing other disambig algorithms
11 Feb 2004 Web-A-Where - Geotagging Web Content 34
My gourmet friend went to India because he heard there was a new deli there.
11 Feb 2004 Web-A-Where - Geotagging Web Content 35
What is the geography of a page?What is the geography of a page?
ronny@asgard=> whois -h whois.isoc.org.il
iguide.co.il
person: John Neystadtaddress: HaRav-Mimon 18/5address: Bat-Yamaddress: 59626address: Israelphone: +972 52 568-542fax-no: +972 3 752-6012e-mail: [email protected]
% Rights to the data above are restricted by copyright.
ronny@asgard=>
11 Feb 2004 Web-A-Where - Geotagging Web Content 36
What is the geography of a page?What is the geography of a page?
asgard=> whois -h whois.isoc.org.il iguide.co.il
person: John Neystadtaddress: HaRav-Mimon 18/5address: Bat-Yamaddress: 59626address: Israelphone: +972 52 568-542fax-no: +972 3 752-6012e-mail: [email protected]
% Rights to the data above are restricted by copyright.
asgard=>
ronny@asgard> nslookup www.iguide.co.ilName: stalker.iguide.co.ilAddress: 194.90.246.9Aliases: www.iguide.co.il
ronny@asgard>ronny@asgard> whois -h whois.ripe.net 194.90.246.9descr: NetVisioncountry: ILroute: 194.90.0.0/16descr: Netvisiondescr: Omega Bldg.descr: MATAM industrial parkdescr: Haifa 31905descr: Israelphone: +972 4 8560 600fax-no: +972 4 8551 132
ronny@asgard>
11 Feb 2004 Web-A-Where - Geotagging Web Content 37
What is the geography of a page?What is the geography of a page?
11 Feb 2004 Web-A-Where - Geotagging Web Content 38
What is the geography of a page?What is the geography of a page?
11 Feb 2004 Web-A-Where - Geotagging Web Content 39
What is the geography of a page?What is the geography of a page?