Transcript
Page 1: Pakistan Census Data – Case Study

Pakistan Census Data

Collection for a Robust Distribution

Case Study

Page 2: Pakistan Census Data – Case Study

Population

187,418,849 est.

3rd August 2014Source: census.gov.pk

Page 3: Pakistan Census Data – Case Study

Objectives

● Data availability

● Open data

● Transparency

● Robust access

● Widely accessible formats

Page 4: Pakistan Census Data – Case Study

Sources● Population Census Organization (census.gov.pk)

● World Bank (data.worldbank.org)

● ReliefWeb (reliefweb.int)

● USAID (usaid.gov)

Page 5: Pakistan Census Data – Case Study

Best Source● Population Census Organization (census.gov.pk)

Detailed data exists but not available in reusable and widely accessible formats. In fact,

the website itself is not available most of the time.

● World Bank (data.worldbank.org)

● ReliefWeb (reliefweb.int)

● USAID (usaid.gov)

Data available in different accessible formats but data is brief, limited and directed.

Page 6: Pakistan Census Data – Case Study

Problems

● No downloadable data format available

● Website inaccessible most times of 24 hours

● No semantic management for available data

● No easy way to access the data programmatically

Page 7: Pakistan Census Data – Case Study

Collection Methodology

● Start with 1998 census data1.

● Data available for each district.

● Each district data accessible2 as HTML page.

● Patience!

1. Who am I kidding?! That is the only census data available.2. Only when website is available & accessible

Page 8: Pakistan Census Data – Case Study

First idea; Last idea

● Scrap the website

● Scrap the files on website

Page 9: Pakistan Census Data – Case Study

Scrap, Covert, Save. Easy Peasy!

PHP Library – Simple HTML DOM

Project website: http://sourceforge.net/projects/simplehtmldom/

Page 10: Pakistan Census Data – Case Study

Easier said than done!1. Server non-responsive to script calls.

2. Server unavailable after script comes across an error.

3. Ridiculous latency. (Patience methodology applies here.)

4. Non-semantic data e.g. some districts have extra information

columns; in result, returning error and going back to #2.

5. HTML files were literally saved from Microsoft Office!!

Page 11: Pakistan Census Data – Case Study

Yes, Save as… (Web page)

Page 12: Pakistan Census Data – Case Study

Problems:

● No looping through data files (Server timeout).

● Even with a delay, if an error occurred, its long server timeout.

Solution:

● Manually run script for each file one by one. Process goes as following:

Scrapping, finally

PHP file_get_contents Simple HTML DOM JSON SAVE FILE

Page 13: Pakistan Census Data – Case Study

Received Data - Final Version

Page 14: Pakistan Census Data – Case Study

Restructured Data - Final Version

Page 15: Pakistan Census Data – Case Study

Further Steps; Making Data Useful1. Go for original objectives of this whole process.

2. Restructure all data into a standard format.

3. Acquire missing data.

4. Make it all available for public use.

Get it, share it or contribute to it at git.io/pk-census


Top Related