Download - Pakistan Census Data – Case Study
Pakistan Census Data
Collection for a Robust Distribution
Case Study
Population
187,418,849 est.
3rd August 2014Source: census.gov.pk
Objectives
● Data availability
● Open data
● Transparency
● Robust access
● Widely accessible formats
Sources● Population Census Organization (census.gov.pk)
● World Bank (data.worldbank.org)
● ReliefWeb (reliefweb.int)
● USAID (usaid.gov)
Best Source● Population Census Organization (census.gov.pk)
Detailed data exists but not available in reusable and widely accessible formats. In fact,
the website itself is not available most of the time.
● World Bank (data.worldbank.org)
● ReliefWeb (reliefweb.int)
● USAID (usaid.gov)
Data available in different accessible formats but data is brief, limited and directed.
Problems
● No downloadable data format available
● Website inaccessible most times of 24 hours
● No semantic management for available data
● No easy way to access the data programmatically
Collection Methodology
● Start with 1998 census data1.
● Data available for each district.
● Each district data accessible2 as HTML page.
● Patience!
1. Who am I kidding?! That is the only census data available.2. Only when website is available & accessible
First idea; Last idea
● Scrap the website
● Scrap the files on website
Scrap, Covert, Save. Easy Peasy!
PHP Library – Simple HTML DOM
Project website: http://sourceforge.net/projects/simplehtmldom/
Easier said than done!1. Server non-responsive to script calls.
2. Server unavailable after script comes across an error.
3. Ridiculous latency. (Patience methodology applies here.)
4. Non-semantic data e.g. some districts have extra information
columns; in result, returning error and going back to #2.
5. HTML files were literally saved from Microsoft Office!!
Yes, Save as… (Web page)
Problems:
● No looping through data files (Server timeout).
● Even with a delay, if an error occurred, its long server timeout.
Solution:
● Manually run script for each file one by one. Process goes as following:
Scrapping, finally
PHP file_get_contents Simple HTML DOM JSON SAVE FILE
Received Data - Final Version
Restructured Data - Final Version
Further Steps; Making Data Useful1. Go for original objectives of this whole process.
2. Restructure all data into a standard format.
3. Acquire missing data.
4. Make it all available for public use.
Get it, share it or contribute to it at git.io/pk-census