pakistan census data – case study

Post on 26-Jun-2015

130 Views

Category:

Data & Analytics

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Case study of collecting Pakistan census data for robust distribution and better availability. This deck discusses the problems faced while accessing public data in general, using this particular case.

TRANSCRIPT

Pakistan Census Data

Collection for a Robust Distribution

Case Study

Population

187,418,849 est.

3rd August 2014Source: census.gov.pk

Objectives

● Data availability

● Open data

● Transparency

● Robust access

● Widely accessible formats

Sources● Population Census Organization (census.gov.pk)

● World Bank (data.worldbank.org)

● ReliefWeb (reliefweb.int)

● USAID (usaid.gov)

Best Source● Population Census Organization (census.gov.pk)

Detailed data exists but not available in reusable and widely accessible formats. In fact,

the website itself is not available most of the time.

● World Bank (data.worldbank.org)

● ReliefWeb (reliefweb.int)

● USAID (usaid.gov)

Data available in different accessible formats but data is brief, limited and directed.

Problems

● No downloadable data format available

● Website inaccessible most times of 24 hours

● No semantic management for available data

● No easy way to access the data programmatically

Collection Methodology

● Start with 1998 census data1.

● Data available for each district.

● Each district data accessible2 as HTML page.

● Patience!

1. Who am I kidding?! That is the only census data available.2. Only when website is available & accessible

First idea; Last idea

● Scrap the website

● Scrap the files on website

Scrap, Covert, Save. Easy Peasy!

PHP Library – Simple HTML DOM

Project website: http://sourceforge.net/projects/simplehtmldom/

Easier said than done!1. Server non-responsive to script calls.

2. Server unavailable after script comes across an error.

3. Ridiculous latency. (Patience methodology applies here.)

4. Non-semantic data e.g. some districts have extra information

columns; in result, returning error and going back to #2.

5. HTML files were literally saved from Microsoft Office!!

Yes, Save as… (Web page)

Problems:

● No looping through data files (Server timeout).

● Even with a delay, if an error occurred, its long server timeout.

Solution:

● Manually run script for each file one by one. Process goes as following:

Scrapping, finally

PHP file_get_contents Simple HTML DOM JSON SAVE FILE

Received Data - Final Version

Restructured Data - Final Version

Further Steps; Making Data Useful1. Go for original objectives of this whole process.

2. Restructure all data into a standard format.

3. Acquire missing data.

4. Make it all available for public use.

Get it, share it or contribute to it at git.io/pk-census

Thank you!

@jabranr | hello@jabran.me

http://git.io/pk-census

top related