web-a-where: geotagging web content

25
Web-a-Where: Geotagging Web Content Einat Amitay, Nadav Har’EI, Ron Sivan and Aya Soffer IBM Haifa Research Lab, Haifa, Is rael (SIGIR 2004, p.273-280)

Upload: joel-holden

Post on 04-Jan-2016

31 views

Category:

Documents


3 download

DESCRIPTION

Web-a-Where: Geotagging Web Content. Einat Amitay, Nadav Har’EI, Ron Sivan and Aya Soffer IBM Haifa Research Lab, Haifa, Israel (SIGIR 2004, p.273-280). Abstract. Web-a-Where A system for associating geography with Web pages. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Web-a-Where: Geotagging Web Content

Web-a-Where: Geotagging Web Content

Einat Amitay, Nadav Har’EI, Ron Sivan and Aya Soffer

IBM Haifa Research Lab, Haifa, Israel(SIGIR 2004, p.273-280)

Page 2: Web-a-Where: Geotagging Web Content

2/25

Abstract Web-a-Where

A system for associating geography with Web pages. It locates mentions of places and determines the place eac

h name refers to. It assigns to each page a geographic focus: a locality that t

he page discusses as a whole. Precision of the tagger up to 82% on individual geot

ags is achieved. 91% of the foci reported are correct up to the countr

y level.

Page 3: Web-a-Where: Geotagging Web Content

3/25

Introduction (1/3) A page may have two types of geography

associated with it: a source and a target. A news article about Northern Ireland appearing

on the CNN site (USA). Two types of ambiguities:

geo/non-geo: e.g., a person name, Berlin, or a common word, Turkey.

geo/geo: distinct places have the same name: London, England vs. London, Ontario.

Page 4: Web-a-Where: Geotagging Web Content

4/25

Introduction (2/3) In USA, there are 18 cities named Jerusalem,

24 named Paris and 63 Springfields in 34 states.

On Web-pages, 37% have several possible geographic meanings, and the average number of possible meanings per mention is roughly 2.

Page 5: Web-a-Where: Geotagging Web Content

5/25

Introduction (3/3) Geographic name disambiguation

NLP employs machine learning to recognize names from their structure and context=> too complex for tremendous data on Web to work fast.

Gazetteer approach uses the glossary to recognize. Web-a-Where employs the gazetteer approach. Its goal is to identify all geographic mentions in

Web pages, assign a geographic location and confidence level to each, and derive a focus (or foci) for the entire page.

Page 6: Web-a-Where: Geotagging Web Content

6/25

Tagging Individual Place Names Geotagger: find and disambiguate geographic names,

currently those of cities, states and countries. e.g. Paris/France/Europe (called taxonomy node).

The list of geographic names, their canonical taxonomies and other pertinent information is kept in a database known as a gazetteer.

The processing of a page is done in three phases: Spotting Disambiguation Focus determination

Page 7: Web-a-Where: Geotagging Web Content

7/25

The Gazetteer (1/3) The gazetteer contains a hierarchical view of the

word, divided into continents, countries, states (for some countries) and cities.

A place is associated with a canonical taxonomy node, a number of names and/or abbreviations, world coordinates and a population estimation.

Nearly 40,000 places are listed, together with alternative spellings and abbreviations for a total of about 75,000 names: most in English and some in the vernacular.

Page 8: Web-a-Where: Geotagging Web Content

8/25

The Gazetteer (2/3) The gazetteer is automatically created from a

number of freely available data sources. The gazetteer contains a non-geo section

listing place names that are also very commonly used word. E.g., “To” (Myanmar), or “Of” (Turkey) are very commonly used.

Page 9: Web-a-Where: Geotagging Web Content

9/25

The Gazetteer (3/3) Two tests for the non-geo section

Names that appeared more than 100 times, but in most cases were not capitalized as a name should be, were included. E.g., “Asbestos” (Quebec) and “Humble” (Texas).

Names mentioned much more frequently than their population would warrant were also included. E.g. “Grove” (Spain) (10,976) and “Atlantic” (Iowa) (7,474). Their high frequency did not match their small populations.

Require a manual pass: e.g. remove “Aspen”, add “Metro”

Page 10: Web-a-Where: Geotagging Web Content

10/25

Spotting Place Name Candidates To find (or spot) all the possible geographic

names in each page. Case-insensitive appearance. The list of words to spot is the list of all

names in the gazetteer. Short abbreviations are not spotted since

often they are too ambiguous, as IN (Indiana or India, but also a common English preposition) or AT (for Australia).

Page 11: Web-a-Where: Geotagging Web Content

11/25

Disambiguating Spots (1/2) If the tokens in the vicinity of a spot can uniquely

qualify it, as in “IN” immediately following a spot of “Chicago”, the geotagger assigns this meaning to that spot with a confidence in the range of 0.95 – 1. Otherwise, left unassigned for the moment.

Each unresolved spot is assigned its default meaning: the largest population. 0.5 confidence.

In case the page has multiple spots of the same name where only one is qualified, the meaning of the qualified spot is delegated to the others. Confidence range: 0.8-0.9.

Page 12: Web-a-Where: Geotagging Web Content

12/25

Disambiguating Spots (2/2) A disambiguating context for the spots that are still

unresolved (confidence<0.7) is now sought. “London” and “Hamilton” appear in the same page

without further qualifications. “London” include “England, UK” and “Ontario, Canada” while “Hamilton” exists in “Ohio, USA”, “Ontario, Canada” as well as in “New Zealand”. The smallest disambiguating context is “Ontario, Canada”.

Confidence between 0.65 and 0.75, depending on whether the assigned meaning matches the spot’s default meaning.

Page 13: Web-a-Where: Geotagging Web Content

13/25

Determining Page Focus Once we determined the correct meaning of

every geographical name mentioned in the page, we would also like a way to decide which geographic mentions are incidental, and which constitute the actual focus of the page.

Page 14: Web-a-Where: Geotagging Web Content

14/25

Rationale of Focus Algorithm The basic idea is that if several cities from the same

region are mentioned, this might mean that this region is the focus.

E.g. San Francisco (Calif.), Los Angeles (Calif.) and San Diego (Calif.) => California.

E.g. San Jose (Calif.), Chicago (Ill.) => US. Sometimes there are many foci in a page. => Try to

coalesce many places into one region before declaring foci.

If a small region is the real focus, we should not report a larger region.

Page 15: Web-a-Where: Geotagging Web Content

15/25

Outline of Focus Algorithm (1/3) Add a score to the taxonomy node (e.g.,

Paris/France/Europe), while adding lower scores to the enclosing hierarchies (France/ Europe and Europe).

The region that the focus-finding algorithm reports is a taxonomy node from the gazetteer.

The focus is limited to being a city, state, country or continent.

Page 16: Web-a-Where: Geotagging Web Content

16/25

Outline of Focus Algorithm (2/3) A page contains 4 Orlando/Florida (.5), 3

Texas (.75), 8 For Worth/Texas (.75), 3 Dallas/Texas (.75), one Garland/Texas (.75), and 1 Iraq (.5).

A human: about Texas and perhaps also Orlando.

Indeed, that page comes from the “Orlando Weekly” site, in a forum titled “Just a look at the Texas Local Music Scene”.

Page 17: Web-a-Where: Geotagging Web Content

17/25

Outline of Focus Algorithm (3/3)

Texas got the top score because several separate cities – Fort Worth, Dallas and Garland contributed to it and is chosen as a focus.

US already cover Texas, so it is dropped. Fort Worth is covered by Texas and is dropped. Orlando/Florida is taken as a 2nd focus. Iraq/Asia is below the importance threshold (.9), so is ignored.

Page 18: Web-a-Where: Geotagging Web Content

18/25

The Focus Scoring Algorithm For a place with a taxonomy node A/B/C with

confidence p[0, 1]. Add p2 to A/B/C. Add p2d to the enclosing region B/C. Add p2d2 to C, where 0<d<1 (0.7) Sort by score, loop over them from highest to lowest. Stop at the low threshold (.9), or if sufficient many foci

were already found (4). Skip taxonomy levels that cover or are covered by on

already selected as focus. Otherwise, add this level to the list of foci.

Page 19: Web-a-Where: Geotagging Web Content

19/25

Implementation The Web-a-Where geotagger, implemented as

a WebFountain miner, outputs the meaning for each place name in the text, together with a confidence figure for that meaning.

It also produces a set of up to four foci per page.

Page 20: Web-a-Where: Geotagging Web Content

20/25

Evaluating the Geotagger (1/3) Evaluating corpus: 200 pages per collection

Arbitrary collection Query Google with :+the, +and, +in Collect the top 1000 results for each query

.GOV collection .gov domain for the standard test of TREC03.

Open Directory Project (ODP) collection Regional sub-directory

The chosen pages are larger then 3k.

Page 21: Web-a-Where: Geotagging Web Content

21/25

Evaluating the geotagger

Page 22: Web-a-Where: Geotagging Web Content

22/25

Evaluating the Geotagger (3/3)

Page 23: Web-a-Where: Geotagging Web Content

23/25

Testing Page Focus (1/2) Random 20,000 Web-pages from the

ODP’s Regional section that were larger than 3k.

Page 24: Web-a-Where: Geotagging Web Content

24/25

Testing Page Focus (2/2) Will enlarging the

gazetteer by adding smaller and smaller towns improve the page-focus determination, or hurt it?=> Web-a-Where is helped, not hurt, by an improved gazetteer.

Page 25: Web-a-Where: Geotagging Web Content

25/25

Conclusion Experiments show that it can correctly tag ind

ividual name place occurrences 80% and recognize the correct focus of a page 91%.

The main source of errors is geo/non-geo ambiguity. => use part-of-speech tagger.

Geo/geo accuracy can be improved by improving the “disambiguating context” heuristics, and by devising additional ones.