john fagan - the black art of geocoding
DESCRIPTION
Mapping/LBS applications require 3 core engines, namely Mapping, Routing and Geocoding. The latter is often overlooked, but Geocoding is the fundamental component of all Mapping and LBS applications. If you don’t have a lat/lon, then how do you find a map, how do you get from a to b, how do you plot your data? This paper will give a whistlestop tour of the basics of mapping and routing engines and then do a deep dive on Geocoding. It will suggest that we have solved routing and mapping, but we have a lot of work to do with Geocoding.TRANSCRIPT
The Black Art of GeocodingFinding that elusive lat/lon
John Fagan, Microsoft
The Black Art of Geocoding
Finding that elusive lat/lon
John FaganProgram Manager
Microsoft Corporation
@johnbfagan
We been making maps for 1000’s of years
Well known and established standards/principles
Lots of experience in building software to create bitmaps from vector and raster data
Mapping easy to scale
...and so is routing
1000’s years experience in wayfindingOver 50 years experience in routing algorithmsDijkstra's shortest path algorithm (1959)
Routing, easy to scale
Geocoding not so easy
20 years experience10 years of global Geocoding5 years exposing geocoding to the mass consumer
No standard algorithmsVery few databases purpose built (maybe GNAF)Very hard to scale
Geocoding is fundamental
Cant get a map without a geocodeCant get a route without a geocodeCant view your data without a geocode80% of all information contains a geographic element.
It used to be easier
Now its hard
User expectations change with unstructured input
67 hill veiw road, s61 2bn in the 1850's1.5 hours from Niceexact directions from Bangkok Patana School to Suvanapumi Airport in Bangkok.10 mile radius from se20 7uahow long would it take me to walk around cancunhow to get to m13 gb from g83 9le by cardo bearded dragons bite?
But ......Geocoding NOT about Search
52.19157,-1.70415
The reason it's called 'I'm Feeling Lucky,' is of course that's a pretty damn ambitious goal. I mean to get the exact right one thing without even giving you a list of choices, and so you have to feel a little bit lucky if you're going to try that with one go," tried to explain Sergey Brin.
Why is it hard (2 reasons)
Parsing: Hard to understand unstructured input
Finding Stratford-upon-Avon
stratfordstratford upon avonStratford upon havenStratfordUponAvonStratford-Upon-Avonstratford on avon stratford-on-avonstratford 0n avonstratford - upon-avon stratford on avaonstratford apon avonstratford upon aavonstratford uppon avon
Finding Stratford-upon-Avon
Parsing
In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens (for example, words), to determine its grammatical structure with respect to a given (more or less) formal grammar.
http://en.wikipedia.org/wiki/Parsing
Old way of Parsing – Rules based
A rules based approach (mainly done with regular expressions)
Probabilistic approach
Machine learnedRequires you to “train” the engineRequires truth sets of training data
http://en.wikipedia.org/wiki/Hidden_Markov_model
Probabilistic approach: Hidden Markov Model
input --> 165 fleet street london EC4A 2DY
output --> address {
street number : 165street : fleet street city : london postcode : EC4A 2DY }
Multimap stats
grammar % share
Postcode 67.9
Locality 14.8
Landmark 3.3
Street name 3
Street name, Localiry 2.4
street number street name 0.5
County 0.5street number street name locality 0.5
Locality county 0.4
Parsing has its limitations
Parsing failuresMultimap/Bing Maps (st andrews scotland)Google (uk near Boston, MA, USA)All fail - House number plus postcode (165, EC4A 2DY)
Parsing using a Spatial Engine
http://research.microsoft.com/en-us/people/josephj/acm_gis_2007_robust_location_search.pdf
Why is it hard (Data)
[OSM-talk] Baghdad mapsI am informed that any road may have up to 4 names (which may be the same or different):
1) The pre-Saddam name 2) The Saddam-era name. 3) The "public" name - What the people who live there call it. 4) The "Official" name - What the new Government calls it.
This situation is further complicated by language and social issues: Language
5) The roads are names in Arabic.6) There is no fixed translation between the Arabic and Latin alphabets.
Social Issues:
1) Sunnis tend to use the Saddam-era names 7) Shia tend to rename streets and won't acknowledge Saddam-era names. 8) Ethnic cleansing is changing the neighbourhoods and hence the names. 9) Names (such as 14th July Bridge) will change later.
My translator's opinion is that street names are going to take at least 2-3 years to settle down.
http://lists.openstreetmap.org/pipermail/talk/2007-February/011273.html
Don't throw away your data
Multimap have always kept old postcodes10% of Multimap’s postcode database is of “dead” postcodesThis might not work for routing and mapping, but very valuable for Geocoding
EC4A 1HE – Postcode of vintage 2002
Lash data and enrich
Stratford-upon-Avon
Future = Real time Geocoding?
Summary
Mapping and Routing – FIXEDGeocoding – Must Try HarderParsing Data
thanks
john fagan
ubergeo.com
@johnbfagan