web-a-where: geotagging web content

Web-a-Where: Geotagging Web Content

Einat Amitay, Nadav Har’EI, Ron Sivan and Aya Soffer

IBM Haifa Research Lab, Haifa, Israel(SIGIR 2004, p.273-280)

2/25

Abstract Web-a-Where

A system for associating geography with Web pages. It locates mentions of places and determines the place eac

h name refers to. It assigns to each page a geographic focus: a locality that t

he page discusses as a whole. Precision of the tagger up to 82% on individual geot

ags is achieved. 91% of the foci reported are correct up to the countr

y level.

3/25

Introduction (1/3) A page may have two types of geography

associated with it: a source and a target. A news article about Northern Ireland appearing

on the CNN site (USA). Two types of ambiguities:

geo/non-geo: e.g., a person name, Berlin, or a common word, Turkey.

geo/geo: distinct places have the same name: London, England vs. London, Ontario.

4/25

Introduction (2/3) In USA, there are 18 cities named Jerusalem,

24 named Paris and 63 Springfields in 34 states.

On Web-pages, 37% have several possible geographic meanings, and the average number of possible meanings per mention is roughly 2.

5/25

Introduction (3/3) Geographic name disambiguation

NLP employs machine learning to recognize names from their structure and context=> too complex for tremendous data on Web to work fast.

Gazetteer approach uses the glossary to recognize. Web-a-Where employs the gazetteer approach. Its goal is to identify all geographic mentions in

Web pages, assign a geographic location and confidence level to each, and derive a focus (or foci) for the entire page.

6/25

Tagging Individual Place Names Geotagger: find and disambiguate geographic names,

currently those of cities, states and countries. e.g. Paris/France/Europe (called taxonomy node).

The list of geographic names, their canonical taxonomies and other pertinent information is kept in a database known as a gazetteer.

The processing of a page is done in three phases: Spotting Disambiguation Focus determination

7/25

The Gazetteer (1/3) The gazetteer contains a hierarchical view of the

word, divided into continents, countries, states (for some countries) and cities.

A place is associated with a canonical taxonomy node, a number of names and/or abbreviations, world coordinates and a population estimation.

Nearly 40,000 places are listed, together with alternative spellings and abbreviations for a total of about 75,000 names: most in English and some in the vernacular.

8/25

The Gazetteer (2/3) The gazetteer is automatically created from a

number of freely available data sources. The gazetteer contains a non-geo section

listing place names that are also very commonly used word. E.g., “To” (Myanmar), or “Of” (Turkey) are very commonly used.

9/25

The Gazetteer (3/3) Two tests for the non-geo section

Names that appeared more than 100 times, but in most cases were not capitalized as a name should be, were included. E.g., “Asbestos” (Quebec) and “Humble” (Texas).

Names mentioned much more frequently than their population would warrant were also included. E.g. “Grove” (Spain) (10,976) and “Atlantic” (Iowa) (7,474). Their high frequency did not match their small populations.

Require a manual pass: e.g. remove “Aspen”, add “Metro”

10/25

Spotting Place Name Candidates To find (or spot) all the possible geographic

names in each page. Case-insensitive appearance. The list of words to spot is the list of all

names in the gazetteer. Short abbreviations are not spotted since

often they are too ambiguous, as IN (Indiana or India, but also a common English preposition) or AT (for Australia).

11/25

Disambiguating Spots (1/2) If the tokens in the vicinity of a spot can uniquely

qualify it, as in “IN” immediately following a spot of “Chicago”, the geotagger assigns this meaning to that spot with a confidence in the range of 0.95 – 1. Otherwise, left unassigned for the moment.

Each unresolved spot is assigned its default meaning: the largest population. 0.5 confidence.

In case the page has multiple spots of the same name where only one is qualified, the meaning of the qualified spot is delegated to the others. Confidence range: 0.8-0.9.

12/25

Disambiguating Spots (2/2) A disambiguating context for the spots that are still

unresolved (confidence<0.7) is now sought. “London” and “Hamilton” appear in the same page

without further qualifications. “London” include “England, UK” and “Ontario, Canada” while “Hamilton” exists in “Ohio, USA”, “Ontario, Canada” as well as in “New Zealand”. The smallest disambiguating context is “Ontario, Canada”.

Confidence between 0.65 and 0.75, depending on whether the assigned meaning matches the spot’s default meaning.

13/25

Determining Page Focus Once we determined the correct meaning of

every geographical name mentioned in the page, we would also like a way to decide which geographic mentions are incidental, and which constitute the actual focus of the page.

14/25

Rationale of Focus Algorithm The basic idea is that if several cities from the same

region are mentioned, this might mean that this region is the focus.

E.g. San Francisco (Calif.), Los Angeles (Calif.) and San Diego (Calif.) => California.

E.g. San Jose (Calif.), Chicago (Ill.) => US. Sometimes there are many foci in a page. => Try to

coalesce many places into one region before declaring foci.

If a small region is the real focus, we should not report a larger region.

15/25

Outline of Focus Algorithm (1/3) Add a score to the taxonomy node (e.g.,

Paris/France/Europe), while adding lower scores to the enclosing hierarchies (France/ Europe and Europe).

The region that the focus-finding algorithm reports is a taxonomy node from the gazetteer.

The focus is limited to being a city, state, country or continent.

16/25

Outline of Focus Algorithm (2/3) A page contains 4 Orlando/Florida (.5), 3

Texas (.75), 8 For Worth/Texas (.75), 3 Dallas/Texas (.75), one Garland/Texas (.75), and 1 Iraq (.5).

A human: about Texas and perhaps also Orlando.

Indeed, that page comes from the “Orlando Weekly” site, in a forum titled “Just a look at the Texas Local Music Scene”.

17/25

Outline of Focus Algorithm (3/3)

Texas got the top score because several separate cities – Fort Worth, Dallas and Garland contributed to it and is chosen as a focus.

US already cover Texas, so it is dropped. Fort Worth is covered by Texas and is dropped. Orlando/Florida is taken as a 2nd focus. Iraq/Asia is below the importance threshold (.9), so is ignored.

18/25

The Focus Scoring Algorithm For a place with a taxonomy node A/B/C with

confidence p[0, 1]. Add p2 to A/B/C. Add p2d to the enclosing region B/C. Add p2d2 to C, where 0<d<1 (0.7) Sort by score, loop over them from highest to lowest. Stop at the low threshold (.9), or if sufficient many foci

were already found (4). Skip taxonomy levels that cover or are covered by on

already selected as focus. Otherwise, add this level to the list of foci.

19/25

Implementation The Web-a-Where geotagger, implemented as

a WebFountain miner, outputs the meaning for each place name in the text, together with a confidence figure for that meaning.

It also produces a set of up to four foci per page.

20/25

Evaluating the Geotagger (1/3) Evaluating corpus: 200 pages per collection

Arbitrary collection Query Google with :+the, +and, +in Collect the top 1000 results for each query

.GOV collection .gov domain for the standard test of TREC03.

Open Directory Project (ODP) collection Regional sub-directory

The chosen pages are larger then 3k.

21/25

Evaluating the geotagger

22/25

Evaluating the Geotagger (3/3)

23/25

Testing Page Focus (1/2) Random 20,000 Web-pages from the

ODP’s Regional section that were larger than 3k.

24/25

Testing Page Focus (2/2) Will enlarging the

gazetteer by adding smaller and smaller towns improve the page-focus determination, or hurt it?=> Web-a-Where is helped, not hurt, by an improved gazetteer.

25/25

Conclusion Experiments show that it can correctly tag ind

ividual name place occurrences 80% and recognize the correct focus of a page 91%.

The main source of errors is geo/non-geo ambiguity. => use part-of-speech tagger.

Geo/geo accuracy can be improved by improving the “disambiguating context” heuristics, and by devising additional ones.

web-a-where: geotagging web content

Documents

possible geographic

list of geographic names

gazetteer approach

geographic focus

geographic mentions

web pages

geographic location

possible geographic