creating open data with open source (beta2)

47
Creating Open Data with Open Source Sammy Fung sammy.hk [ITFest.HK] Seminar of Free / Open Source in Hong Kong, April 2013.

Upload: sammy-fung

Post on 13-May-2015

761 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Creating Open Data with Open Source (beta2)

Creating Open Data withOpen Source

Sammy Fung

sammy.hk

[ITFest.HK] Seminar of Free / Open Source in Hong Kong, April 2013.

Page 2: Creating Open Data with Open Source (beta2)

Agenda

● What is Open Data ?● Use of Open Source Software in web crawling.● Starting new Open Source projects to create

Open Data.

Page 3: Creating Open Data with Open Source (beta2)

Sammy Fung

● Software Developer using open source.– Perl → PHP → Python.– Data Mining / Web Crawling.– Also deploying OpenStack Cloud and Linux Solutions.

● Open Source Community Leader.– opensource.hk, HKLUG, GNOME Asia committee, Mozilla

Rep, and program committee member of the largest Taiwan open source conference - COSCUP.

● Blogger at sammy.hk.

Page 4: Creating Open Data with Open Source (beta2)

Open Data

Three Laws of Open Government Data by David Eaves.

1.If it can't be spidered or indexed, it doesn't exist.

2.If it isn't available in open and machine readable format, it can't engage.

3.If a legal framework doesn't allow it to be repurposed, it doesn't empower.

http://eaves.ca/2009/09/30/three-law-of-open-government-data/

Page 5: Creating Open Data with Open Source (beta2)

Open Data

● Tim Berners-Lee, the inventor of the Web.– 5stardata.info– 5 star deployment scheme of Open Data.

Page 6: Creating Open Data with Open Source (beta2)

* One Star - Open Data

1.make your stuff available on the Web (whatever format) under an open license.

2.make it available as structured data (e.g., Excel instead of image scan of a table)

3.use non-proprietary formats (e.g., CSV instead of Excel)

4.use URIs to denote things, so that people can point at your stuff.

5.link your data to other data to provide context.

5stardata.info by Tim Berners-Lee, the inventor of the Web.

Page 7: Creating Open Data with Open Source (beta2)

** Two Star - Open Data

1.make your stuff available on the Web (whatever format) under an open license.

2.make it available as structured data (e.g., Excel instead of image scan of a table)

3.use non-proprietary formats (e.g., CSV instead of Excel)

4.use URIs to denote things, so that people can point at your stuff.

5.link your data to other data to provide context.

5stardata.info by Tim Berners-Lee, the inventor of the Web.

Page 8: Creating Open Data with Open Source (beta2)

*** Three Star - Open Data

1.make your stuff available on the Web (whatever format) under an open license.

2.make it available as structured data (e.g., Excel instead of image scan of a table)

3.use non-proprietary formats (e.g., CSV instead of Excel)

4.use URIs to denote things, so that people can point at your stuff.

5.link your data to other data to provide context.

5stardata.info by Tim Berners-Lee, the inventor of the Web.

Page 9: Creating Open Data with Open Source (beta2)

**** Four Star - Open Data

1.make your stuff available on the Web (whatever format) under an open license.

2.make it available as structured data (e.g., Excel instead of image scan of a table)

3.use non-proprietary formats (e.g., CSV instead of Excel)

4.use URIs to denote things, so that people can point at your stuff.

5.link your data to other data to provide context.

5stardata.info by Tim Berners-Lee, the inventor of the Web.

Page 10: Creating Open Data with Open Source (beta2)

***** Five Star - Open Data

1.make your stuff available on the Web (whatever format) under an open license.

2.make it available as structured data (e.g., Excel instead of image scan of a table)

3.use non-proprietary formats (e.g., CSV instead of Excel)

4.use URIs to denote things, so that people can point at your stuff.

5.link your data to other data to provide context.

5stardata.info by Tim Berners-Lee, the inventor of the Web.

Page 11: Creating Open Data with Open Source (beta2)

Open Data from HK Government ?

● 2 Use Cases of Data:– Legco Meeting Minutes and Voting Results.– Weather at Data.One.

Page 12: Creating Open Data with Open Source (beta2)

Legco Meeting Minutes and Voting Results

Page 13: Creating Open Data with Open Source (beta2)

Legco Meeting Minutes and Voting Results

Page 14: Creating Open Data with Open Source (beta2)

Legco Meeting Minutes and Voting Results

● All legco voting results are scanned and released in PDF, it is only possible to retrieve voting results manually.

● In recent years, it seems scanned minutes from sheets scanned are replaced by minutes converted from original computer document files.

Page 15: Creating Open Data with Open Source (beta2)

Improving Legco Vote Result Data ?

● Legcovotes.net is created by Hong Kong netitizens(?).

● Only 20 famous vote results are included.● It is possible to let public to input other vote

results by hand, and submissions should be verified by legcovotes.net authoritative.

● Including other data, eg. Minutes in plain text or paragraphs related to a counciler.

Page 16: Creating Open Data with Open Source (beta2)

Weather at Data.One

● My Chinese Blog Post 「香港政府機構開放資料 Open Data 情況」 on 2013/1/17.

● Data.One released on 2011/3/31.● Weather at Data.One provides 7 dataset URLs,

returns RSS (XML) format (Eng/TChi/SChi)– One word: Useless.– Data.One dataset (RSS) is completely different

with HKO own paid service (XML).

Page 17: Creating Open Data with Open Source (beta2)

Weather at Data.One

● Example - Current local weather report: ● Plain text report in RSS.● Difference to quote report content:

– Website: a pair of HTML tags, eg. <PRE>....</PRE>.– Data.One: a pair of RSS description tags,

<description>....</description>.

● Other weather data is missing, eg. Regional temperture updates per each 12 mins.

Page 18: Creating Open Data with Open Source (beta2)

Weather at Data.One

● Weather at Data.One is 'report' but not 'data'.● Weather RSS is already released by HKO

before launch of Data.One.● Technically, json/xml format is better

readable by computer programs.

Page 19: Creating Open Data with Open Source (beta2)

Oversea Open Data Project Examples

● Toronto: – City Data: http://map.toronto.ca/wellbeing/ – Transportation: http://www.rocketradar.net/ – Pollution: http://www.emitter.ca/

● US & Canada:– https://www.crimereports.com/

Page 20: Creating Open Data with Open Source (beta2)

Use of Open Source Software in Web Crawling

● Use Open Source Tools to collect useful and meaningful machine-readable data.

● Doesn't need to wait provider to release data in machine-readable format.

Page 21: Creating Open Data with Open Source (beta2)

Open Source Tools

● Python programming lanugage● with Regular Expression library● Scrapy web crawling framework

Page 22: Creating Open Data with Open Source (beta2)

Why python + scrapy ?

● python: my current favourite programming language for few years.

● scrapy: web crawling framework written in Python.

Page 23: Creating Open Data with Open Source (beta2)

Scrapy

● scrapy: web crawling framework written in Python.

● HtmlXPathSelector● Output: built-in JSON, CSV, XML.● Python: import re

Page 24: Creating Open Data with Open Source (beta2)

My Products

● WeatherHK ← ← ← ● TCTrack

Page 25: Creating Open Data with Open Source (beta2)

WeatherHK● http://twitter.com/weatherhk● hourly current weather report● weather forecast report● tropical signal warning

Page 26: Creating Open Data with Open Source (beta2)

WeatherHK

● Backend: Python + Scrapy + Database + Twitter + NNTP......

● Frontend: Twitter + Newsgroup

Page 27: Creating Open Data with Open Source (beta2)

WeatherHK

● http://twitter.com/weatherhk● Interview by MetroPop in 2009.

Page 28: Creating Open Data with Open Source (beta2)

My Products

● WeatherHK● TCTrack ← ← ←

Page 29: Creating Open Data with Open Source (beta2)

TCTrack

● http://sammy.hk/projects/tctrack/tctrack.php● Plot TC current and forecast tracks over

Google Map.● Source:

– JTWC– HKO

Page 30: Creating Open Data with Open Source (beta2)

TCTrack

● http://sammy.hk/projects/tctrack/tctrack.php● Probably first tctrack map in HK using

GoogleMap● Use of GMap: TCTrack -> Weather

Underground Hong Kong -> HKO

Page 31: Creating Open Data with Open Source (beta2)

TCTrack

● http://twitter.com/tctrack● Tweet JTWC updates for Northwest Pacific.

Page 32: Creating Open Data with Open Source (beta2)

Starting new Open Source projects to create Open Data

● Develop a open source project.● Release data in standard machine-readable

data format.

Page 33: Creating Open Data with Open Source (beta2)

Open Source Project Examples

● Hk0weather● My weather related open source project.

Page 34: Creating Open Data with Open Source (beta2)

hk0weather

● https://github.com/sammyfung/hk0weather● Open Source Hong Kong Weather Project.● convert to JSON data from HKO webpages.● python + scrapy● 1st version: from current weather report,

extracting temperture and humidity from 20+ weather stations, export in json format.

Page 35: Creating Open Data with Open Source (beta2)

hk0weather

● https://github.com/sammyfung/hk0weather● $ virtualenv hk0weatherenv● $ source hk0weatherenv/bin/activate● $ pip install scrapy● $ git clone

https://github.com/sammyfung/hk0weather.git● $ cd hk0weather● $ scrapy crawl currwx -t json -o testresult

Page 36: Creating Open Data with Open Source (beta2)

hk0weather[{"humidity": 80, "station": "hko", "temperture": 17, "time": 1360785720},{"station": "kingspark", "temperture": 16, "time": 1360785720},{"station": "wongchukhang", "temperture": 17, "time": 1360785720},{"station": "takwuling", "temperture": 16, "time": 1360785720},{"station": "laufaushan", "temperture": 15, "time": 1360785720},{"station": "taipo", "temperture": 16, "time": 1360785720},{"station": "shatin", "temperture": 17, "time": 1360785720},{"station": "tuenmun", "temperture": 17, "time": 1360785720},{"station": "tseungkwano", "temperture": 16, "time": 1360785720},{"station": "saikung", "temperture": 16, "time": 1360785720},{"station": "cheungchau", "temperture": 17, "time": 1360785720},{"station": "cheungchau", "temperture": 17, "time": 1360785720},

{"station": "tsingyi", "temperture": 17, "time": 1360785720},

{"station": "shekkong", "temperture": 15, "time": 1360785720},

{"station": "tsuenwanhokoon", "temperture": 15, "time": 1360785720},

{"station": "tsuenwanshingmunvalley", "temperture": 17, "time": 1360785720},

{"station": "hongkongpark", "temperture": 17, "time": 1360785720},

{"station": "shaukeiwan", "temperture": 16, "time": 1360785720},

{"station": "kowlooncity", "temperture": 16, "time": 1360785720},

{"station": "happyvalley", "temperture": 18, "time": 1360785720},

{"station": "wongtaisin", "temperture": 17, "time": 1360785720},

{"station": "stanley", "temperture": 16, "time": 1360785720},

{"station": "kwuntong", "temperture": 15, "time": 1360785720},

{"station": "shamshuipo", "temperture": 17, "time": 1360785720}]

Page 37: Creating Open Data with Open Source (beta2)

Items.py

class Hk0WeatherItem(Item):

time = Field()

station = Field()

temperture = Field()

humidity = Field()

Page 38: Creating Open Data with Open Source (beta2)

Currwx.py

start_urls = (

'http://www.weather.gov.hk/wxinfo/currwx/currentc.htm',

)

Page 39: Creating Open Data with Open Source (beta2)

Currwx.py

def parse(self, response):

laststation = ''

temperture = int()

stations = []

hxs = HtmlXPathSelector(response)

report = hxs.select('//div[@id="ming"]')

Page 40: Creating Open Data with Open Source (beta2)

libhk0

class hk0:

stations = [

(u' 天 文 台 ', 'hko'),

(u' 京 士 柏 ', 'kingspark'),

(u' 黃 竹 坑 ', 'wongchukhang'),

(u' 打 鼓 嶺 ', 'takwuling'),

(u' 流 浮 山 ', 'laufaushan'),

Page 41: Creating Open Data with Open Source (beta2)

libhk0

class hk0:

def gettime(self, report):

def hk0current(self, report):

Page 42: Creating Open Data with Open Source (beta2)

hk0weather

● Future Planning:● Add more weather reports.● Getting ideas and/or cooperate with 'pro'

Weather hobbists.● Remarks:● Development of hk0weather is started from

ZERO, its code is different than my twitter @weatherhk.

Page 43: Creating Open Data with Open Source (beta2)

Challenge

● Challenge on first day of hk0weather release.● Director of a mobile app developer company

told me by leaving a Facebook comment.– HKO provides data in pretty XML format with their 

annual service plan for commerical companies.– He think that ***MAYBE*** HKO would provide XML 

to you ***without*** any charges if I asked.● Remark: This is an assumption only, not listed on HKO 

website.

Page 44: Creating Open Data with Open Source (beta2)

Challenge

● I replied the following to him after googling for HKO XML schema.– HKO didn't mention 'free of charge service' of XML data feed on

website.– I registered and got authorization from HKO to re-distribute their

weather information for non-profit making. And I received some emails from HKO for any updates of website and HTML structure, but never mention about XML data feed service.

– Weather data available on HKO XML data feed is still fewer than its HTML website.

● So, this challenge is FAIL! XD

Page 45: Creating Open Data with Open Source (beta2)

Open Data Project Examples

● Open Government initiative from HKU JMSC.● http://opengov.jmsc.hku.hk/ ● https://github.com/jmschku

Page 46: Creating Open Data with Open Source (beta2)

Agenda

● What is Open Data ?● Use of Open Source Software in web crawling.● Starting new Open Source projects to create

Open Data.

Page 47: Creating Open Data with Open Source (beta2)

Thank You!sammy.hk