data prospecting

Download Data prospecting

If you can't read please download the document

Upload: swee-meng-ng

Post on 16-Apr-2017

666 views

Category:

Technology


0 download

TRANSCRIPT

Data Prospecting

a.k.a Adventure in 1st year of sinar project

Who am I?

Software Developer in OnApp at day

Code Monkey in sinar project at night

Who is sinar project

We are a group of concerned citizen that decides to use technology to make govt process more transparent. We are also have interest in open data, and understand what open data can do.

What we trying to do

We try to use technology to create a transparent government, and with the help of civil society we collaborate with, make citizen involve in the process.

We also interested in bring the data to software developers, to be used to make app, or anything related.

Why?

Because- It help reduce corruption- We can use government data to do many thing, to provide data for apps, etc.

Good goal except- The govt don't have an open data policy- Nor we have freedom of information act- Some data is well hidden(not many people know)- Some incomplete- Many don't exist

We go to the field

API

Scraping

Crowdsourcing

To jump start it, we start by the following process

API

- Lets start with API. API should be familiar to most. - There is not many API usable for sinar purpose, some is noisy, other is in free text(hard to parse)- Maps is somewhat of an exception, but we still lack a big number of information needed for many project, such as boundary etc. - Business information on map exist, but comes licensing issues is a concern, for example reusing foursquare data, and reusing google geocoding api outside of map. There is clauses against this. - World Bank is the true exception, excellent data source with permissive licence

If there is no API...

- Since the govt don't have API.- We going to scrape it. You will be surprise what kind of information is available on websites. - For those that exist, many is incomplete, but some can be use as a seed for a bigger project.

For example * parliament with mp http://www.parlimen.gov.my/index.php?modload=ahlidewan&uweb=drbills http://www.parlimen.gov.my/index.php?modload=document&uweb=dr&doc=bills* AG chambers site with somecourt case: http://www.agc.gov.my/index.php?option=com_content&view=article&id=175&Itemid=63&lang=engazzette: http://www.federalgazette.agc.gov.my/ * ministry of health have, medical device: http://www.mdb.gov.my/mdb/index.php?option=com_content&task=view&id=20&Itemid=65Medicine price: http://www.pharmacy.gov.my/index.cfm?&menuid=154&parentid=163&lang=EN

So a scraper we make

- A scraper is a script that extract data from webpage and convert it into a structured format- It can practically written in most programming language, store as file or in database. - Most of our scraper uses python, simply because it is a language we are comfortable with.

Above is our MP scraper, our early mp

Scraperwiki

- open data is one of our goal- data need to shared outside. - scraperwiki is a solution- Free, provide storage, host a scraper, schedule jobs to run scraper.- Many open data project use it

Output

- Scraper output can be in json, csv, - the first MP and CIDB is in csv form. - We also use a database, billwatcher is an example. Billwatcher also use elasticsearch, for search- Above is one of our earlier scraper https://scraperwiki.com/scrapers/malaysian_mp_profile/- The data can be downloaded on the link

There is this little problem...

Scraping can only get us that far, - the data can be incomplete. - But most of the time, the data simply not available, crime data is one. - Sometime if the data exist, it is in a hard to process format. PDF, excel, video- Some data is scattered around, MyMP is such.

Not very easy to write a scraper for this.

Thats where the cavalry comes in

That is when we ask for help. - People can help a lot better compared to computer- The bonus from asking for help is, we can get real experienced people worked on a problem, especially when we approach civil society working on a issue.

Our first experiment to ask for help is MyMP.

MyMP

- MyMP is a project with collaboration with Undimsia. - It can be found at http://reps.sinarproject.org/- We are collecting MP information for voter education. - A big part of information comes from interview, internet search. This is powered by plone a CMS.

Crowd Computing!

We manage to get quite a number of mp information out. So technology is not the issue.

This little problem

Lack of information.Fatigue

- Lack of information however is a big issue. In this case, MP not approachable, no information online etc. - It got too hard, volunteer tend to leave. - We realized that this is a serious research task, in which people pay researcher for. - This still going on though a bit slower.

Outside of crowdsourcing

Buy

Ask

Other method to get data- Some information can be bought, SSM again is a good example. Is not scalable if from own pocket- We can try to ask, we know some initiative is successful in asking. But we are a very small group. - Though NGO might have data somewhere, which is why we are try collaborate with more groups for this.

Worst Case

It just means the data ends up in a blackhole, or simply don't exist.

After data is gathered

After data gathering is completed.- We will need to process data a bit. - For example, the cidb data set on the screen is a list of documents, that is harder to process than say a flat json or csv. - In fact we are putting it into google fusion table, it is nicer to flatten it. - This is done in a few way, we have a script for this.

Result of Processed Data

This is from our CIDB data on googlefusion table, show the CSV content generated from processing json previously. The script to generate csv is in https://github.com/Sinar/cidb_json2db/blob/master/json2db.php

Written in php, convert the field name, take the json and split into different CSV

How we use it

In the end we can use this to feed into an application, for example that is our CIDB Data on our fusion table.
With fusion table doing their magic. Project Datasethttps://www.google.com/fusiontables/DataSource?docid=1nTiuWSBXqvqphUj9l5axW496WJiFa51Uhw18T7gDirector Datasethttps://www.google.com/fusiontables/DataSource?docid=10WxkMewqZS7i67Qg-Hyknwx2_UdTKjnVqU9sgzACompany Datasethttps://www.google.com/fusiontables/data?docid=1D4uCH96DRabvOIkUTaAEVxNKvpoIcbQCFkf4OaQ

Even direct to app

or make a new application from the data. The billwatcher is build on bill dataset we scraped,

http://billwatcher.sinarproject.org/https://github.com/sinar/Malaysian-Bill-Watcher

Then we can use with

We encourage people to use the tool of their choice to make use of the data.

What next?

Currently maintain the existing project

Add more dataset,

Engagement with other civil society

Engaging volunteers, but we can be selective on who

Find funding(we are working on it!)

Want to help?

Before this

Get to know group involved.

Join meetups

Understand the issues at hand

It helps a lot.

Groups like undimsia have been working on issues for sometime, undimsia involve in voter education, transparency international in corruption etc.

Join in the meetup, get involved, understand how they work. What we learn is tech is not everything, but tech can help them a lot. But first understand these groups, don't just push tech because it is cool. Their events can be fun

http://www.undimsia.com/http://www.loyarburok.com/

Now you can start

- We need Malaysian contribution to OpenSpending, a project to keep track of govt project- Pretty easy, but tedious, you need to read the budget and add into google spreadsheet or produce a CSV- The openspending.org have the guide at http://openspending.org/help/index.html

Fix My Street + Crime Dataset

- We need a FixMyStreet Style project to look at issue on the street- Easy to start now, use crowdmap, it is a hosted Ushahidi instance, which is well known among open data community. - The same project can be use to track crime- The image is for crowdmap project. - Recommended because it have a proper API, allow reuse.

Crowdmap is at https://crowdmap.com/The example project https://klatm.crowdmap.com/

Contribute a scraper

Write scaper and get the data released.

Fork our code and add feature

- Fork our code and add feature. - All our project is open source, we try to be clear with license- Though we tend to be biased toward python and rails and plone. - Our focus is maintenance now. We are reluctant to add new app.- But if you are willing to maintain it, join us!

In fact billwatcher have a few enhancement comes from volunteer, for example the model code is fixed by volunteer.

Thanks for listening

Find us [email protected]

Thats all from me, QnA at the end of the webcamp, find us at sinarproject.org or [email protected]

Click to edit the title text format

Click to edit the title text format

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline Level