open data cook book recipes (draft 0.1) scraper wiki

4
Recipes: Scraper Wiki ScraperWiki is a service that helps you to gather data from websites that do not provide it as raw data. ScraperWiki provides a programming environment where you can write and share a scraper from your browser. ScraperWiki will run your scraper for you once a day, and will make the results available to download and through Application Programming Interfaces (API) for other web programs to use as well. You will need: An account at www.scraperwiki.com (free) Some programming experience A website with structured information on that you want to scrape 1) Explore the structure of the website you are planning to scrape In this example Iʼm looking at the location of Garages to Rent in Oxford City. First I check when viewing the page that the elements I want to scrape are presented fairly uniformly (e.g. there is always the same title for the same thing) as lots of variation in the way similar things are presented makes for difficult scraping. Secondly, I take a look at the source code of the web page to explore whether each ʻfieldʼ I want to scrape (e.g. Postcode; Picture etc.) is contained neatly in itʼs own HTML element. In this case, whilst each listing is in a <div> html element, a lot of the rest of the text is only separated by line-breaks. Iʼve used the FireBug plugin for Firefox web browser to look at the structure of the page, as it allows me to explore in more detail than the standard ʻView Sourceʼ feature on most browsers. 2) Create a new Scraper on Scraper Wiki Iʼm going to be creating a PHP scraper as this is the programming language Iʼm most comfortable with, but you can also create scrapers using Python and Ruby languages. The PHP Startup Scraper will load with some basic code for fetching a web page and starting to parse it already set. It makes use of the simple_html_dom library which allows you to access elements of web pages using simple selectors. Change the default URL so scraperwiki is finding the page you are interested in. Then also change the line ʻforeach($dom- >find('td') as $data)ʼ using a selector identified in your earlier exploration to Drafts for the Open Data Cook Book - see http://www.opendatacookbook.net

Upload: tim-davies

Post on 07-Apr-2015

337 views

Category:

Documents


4 download

DESCRIPTION

A how to on using Scraper Wiki prepared as a first draft for the OpenDataCookBook.net project.

TRANSCRIPT

Page 1: Open Data Cook Book Recipes (Draft 0.1) Scraper Wiki

Recipes: Scraper WikiScraperWiki is a service that helps you to gather data from websites that do not provide it as raw data. ScraperWiki provides a programming environment where you can write and share a scraper from your browser. ScraperWiki will run your scraper for you once a day, and will make the results available to download and through Application Programming Interfaces (API) for other web programs to use as well.

You will need:• An account at www.scraperwiki.com (free)• Some programming experience• A website with structured information on that you

want to scrape

1) Explore the structure of the website you are planning to scrapeIn this example Iʼm looking at the location of Garages to Rent in Oxford City. First I check when viewing the page that the elements I want to scrape are presented fairly uniformly (e.g. there is always the same title for the same thing) as lots of variation in the way similar things are presented makes for difficult scraping.

Secondly, I take a look at the source code of the web page to explore whether each ʻfieldʼ I want to scrape (e.g. Postcode; Picture etc.) is contained neatly in itʼs own HTML element. In this case, whilst each listing is in a <div> html element, a lot of the rest of the text is only separated by line-breaks.

Iʼve used the FireBug plugin for Firefox web browser to look at the structure of the page, as it allows me to explore in more detail than the standard ʻView Sourceʼ feature on most browsers.

2) Create a new Scraper on Scraper WikiIʼm going to be creating a PHP scraper as this is the programming language Iʼm most comfortable with, but you can also create scrapers using Python and Ruby languages.

The PHP Startup Scraper will load with some basic code for fetching a web page and starting to parse it already set. It makes use of the simple_html_dom library which allows you to access elements of web pages using simple selectors.

Change the default URL so scraperwiki is finding the page you are interested in. Then also change the line ʻforeach($dom->find('td') as $data)ʼ using a selector identified in your earlier exploration to

Drafts for the Open Data Cook Book - see http://www.opendatacookbook.net

Page 2: Open Data Cook Book Recipes (Draft 0.1) Scraper Wiki

see if you can pick out the elements you want to scrape.

For example, each of the listings of Garages for Rent in Oxford are contained within a div with the class ʻpagewidgetʼ, so I can use the selector $dom->find('div.pagewidget') to locate them. (This sort of selector will be familiar to anyone used to working with CSS - Cascading Style Sheets).

3) Check what Scraper Wiki returns and start refining your scraperIf you click ʻRunʼ below your scraper you should now see a range of elements returned in the console. The default PHP template loops through all the elements that match the selector we just set and prints them out to the console.

My scraper returns quite a few elements I donʼt want (there must be more than just the Garage listings picked out by the div.pagewidget selector), so I look for something uniform about the elements I do want. In this case they all start with ʻSite Locationʼ (or at least the plaintext versions of them, as returned by $data->plaintext do.

I can now add some conditional code to my scraper to only carry on processing those elements that contain ʻSite Locationʼ. Iʼve chosen to use the ʻstristrʼ function on PHP that just checks if one string is contained in another and is case insensitive, rather than checking the exact position of the phrase, to be tolerant in case there is variation in the way the data is presented that Iʼve not spotted.

4) Loop, slice and diceThe next steps will depend on how your data is formatted. You may have lots more nested selectors to work through to pick out the elements you want. You can use $data just like the $dom object earlier. So, for example, we can use $data->find("img",0)->src; to return the ʻsrcʼ attribute of the first (0) image element (img) we find in each garage listing.

Sometimes, you get down to text which isnʼt nicely formatted in HTML, and then you will need to use different string processing to pull apart the bits you want. For example, in the Garage listings we can separate each line of plain text by splitting the text by <br>

Drafts for the Open Data Cook Book - see http://www.opendatacookbook.net

Page 3: Open Data Cook Book Recipes (Draft 0.1) Scraper Wiki

elements, and then splitting each line at the colon ʻ:ʼ used to separate titles and values.

A check of the raw source shows the Oxford Garages page uses both <BR> and <br /> as elements so we can use a replace function to standardise these (or we could use regular expressions for splitting).

In the Oxford Garages case as well, our data is split across multiple pages, so once we have the scraper for a single page working right, we can nest it inside a scraper that grabs the list of pages and loops through those too. Scraper Wiki also includes useful helper code for working with forms, for sites where you have to submit searches or make selections in forms to view any data.

5) Save each section of scraped data for use laterTowards the end of each loop through the elements you are scraping (each row in your final dataset) you will need to call the scraperwiki::save() function. This takes four paramaters:

Firstly, an array indicating the name of the unique key in your data that should be used to work out whether a record is new, or an update to an existing record.

Second, an array of data values to save.

Third, the date of the record (for indexing). Leave as null to just use the date the scraper was run.

Fourth, an array of latitude and longitude if you have geocoded your data.

Run you scraper and check the ʻdataʼ tab to see what is being saved.

6) (Optional) Sprinkle in some geocoding as requiredIf you have a UK postcode in your data then you can use the scraperwiki::gb_postcode_to_latlng(); function to turn it into a latitude and longitude, and then save then into your generated dataset.

For example, we can use $lat_lng = scraperwiki::gb_postcode_to_latlng($values['Postcode']); and then when we save our data we add the $lat_lng values to the end of the save function.

Drafts for the Open Data Cook Book - see http://www.opendatacookbook.net

Page 4: Open Data Cook Book Recipes (Draft 0.1) Scraper Wiki

scraperwiki::save(array('Site location'), $values,null, $lat_lng);

7) Run your scraper and explore the results You can now run your scraper. You will be able to access the results as a CSV file, through the scrape wiki API, and to load them into a Google Spreadsheet.

You can also create ʻViewsʼ onto your data, using pre-prepared templates to create maps and other useful visualisations of your data, direct from within Scraperwiki.

Scraperwiki will run your scraper every 24 hours, meaning that as long as it keeps working, you can rely on it as an up-to-date data source.

Below is the map I produced, showing Garages to Rent around Oxford, with the number of garages, photos, and links off to the pages with details about them.

One of the best things about Scraper Wiki overall though, is that it is wiki-like. You can take a look at my Oxford Garages code at http://scraperwiki.com/scrapers/oxford-garages-to-rent/ and you can edit and improve it (and there are lots of potential improvements to be made).

You can also suggest scrapers you would like other people to create, or respond to requests for scrapers from others.

Drafts for the Open Data Cook Book - see http://www.opendatacookbook.net