london xquery meetup: querying the world (web scraping)

18
XQuery: Querying the World (formerly known as Web Scraping) Dennis Knochenwefel <[email protected]>

Upload: dennis-knochenwefel

Post on 25-May-2015

1.529 views

Category:

Technology


0 download

DESCRIPTION

Presentation held at London XQuery Meetup in September 2011. In general, it shows how Web Scraping has naturally evolved towards XQuery. Additionally, it discusses different obstacles in scraping websites. A live example is shown as proof of solving these problems using XQuery.

TRANSCRIPT

Page 1: London XQuery Meetup: Querying the World (Web Scraping)

XQuery: Querying the World(formerly known as Web Scraping)

Dennis Knochenwefel <[email protected]>

Page 2: London XQuery Meetup: Querying the World (Web Scraping)

EvolutionWeb Scraping

Page 3: London XQuery Meetup: Querying the World (Web Scraping)

$url = "http://www.nfl.com/teams/sandiegochargers/roster?team=SD";$raw = file_get_contents($url);$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");$content = str_replace($newlines, "", html_entity_decode($raw));$start = strpos($content,'<table cellpadding="2" class="standard_table"');$end = strpos($content,'</table>',$start) + 8;

$table = substr($content,$start,$end-$start);

preg_match_all("|<tr(.*)</tr>|U",$table,$rows);foreach ($rows[0] as $row){

if ((strpos($row,'<th')===false)){ preg_match_all("|<td(.*)</td>|U",$row,$cells); $number = strip_tags($cells[0][0]); $name = strip_tags($cells[0][1]); $position = strip_tags($cells[0][2]); echo "{$position} - {$name} - Number {$number} <br>\n"; }}

source: http://www.bradino.com/php/screen-scraping/

PHP (2007)

$url = "http://www.nfl.com/teams/sandiegochargers/roster?team=SD";

$raw = file_get_contents($url);

$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");

$content = str_replace($newlines, "", html_entity_decode($raw));

$start = strpos($content,'<table cellpadding="2" class="standard_table"');

$end = strpos($content,'</table>',$start) + 8;

$table = substr($content,$start,$end-$start);

preg_match_all("|<tr(.*)</tr>|U",$table,$rows);foreach ($rows[0] as $row){

if ((strpos($row,'<th')===false)){

preg_match_all("|<td(.*)</td>|U",$row,$cells);

$number = strip_tags($cells[0][0]);

$name = strip_tags($cells[0][1]);

$position = strip_tags($cells[0][2]);

echo "{$position} - {$name} - Number {$number} <br>\n";

}

}

Page 4: London XQuery Meetup: Querying the World (Web Scraping)

$url="http://www.rtu.ac.in/results/reformat.php";

$post="rollnumber=08epccs060&filename=fetchmodulesem_4_btech410m.php&button=Submit";

$ch=curl_init();

curl_setopt($ch,CURLOPT_URL,$url);

curl_setopt($ch,CURLOPT_POST,1);

curl_setopt($ch,CURLOPT_POSTFIELDS,$post);

curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);

curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);

$content=curl_exec($ch);

curl_close($ch);

$totalPath="html/body/table[4]/tbody/tr[3]/td[4]";

$page=new DOMDocument();

$xpath=new DOMXPath($page);

$page->loadHTML($content);

$page->saveHTML();  // this shows the page contents

$total=$xpath->query($totalPath);

echo $total->length;    //shows 0

echo $total->item(0)->nodeValue;   //shows nothing

source: http://stackoverflow.com/questions/6283361/unable-to-get-table-data-from-a-html-page

PHP (June 2011)

$url="http://www.rtu.ac.in/results/reformat.php";

$post="rollnumber=08epccs060&filename=fetchmodulesem_4_btech410m.php&button=Submit";

$ch=curl_init();

curl_setopt($ch,CURLOPT_URL,$url);

curl_setopt($ch,CURLOPT_POST,1);

curl_setopt($ch,CURLOPT_POSTFIELDS,$post);

curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);

curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);

$content=curl_exec($ch);

curl_close($ch);

$totalPath="html/body/table[4]/tbody/tr[3]/td[4]";

$page=new DOMDocument();

$xpath=new DOMXPath($page);

$page->loadHTML($content);

$page->saveHTML();  // this shows the page contents

$total=$xpath->query($totalPath);

echo $total->length;    //shows 0

echo $total->item(0)->nodeValue;   //shows nothing

!

!

Page 5: London XQuery Meetup: Querying the World (Web Scraping)

XQuery

Page 6: London XQuery Meetup: Querying the World (Web Scraping)

Real WorldExample

Page 7: London XQuery Meetup: Querying the World (Web Scraping)

awesome site

awesome data

no API

Page 8: London XQuery Meetup: Querying the World (Web Scraping)

Deal with sessions

Page 9: London XQuery Meetup: Querying the World (Web Scraping)

Need to emulate setting options

Page 10: London XQuery Meetup: Querying the World (Web Scraping)

Different NotionsPublisher <=> Consumer

Page 11: London XQuery Meetup: Querying the World (Web Scraping)

Website App

CSV !HTML !XLS !Zip !

JSON ?XML ?

Page 12: London XQuery Meetup: Querying the World (Web Scraping)

Website App

CSV !HTML !XLS !Zip !

JSON ?XML ?

Session!

Stateless

REST

API ?

Page 13: London XQuery Meetup: Querying the World (Web Scraping)

Website App

CSV !HTML !XLS !Zip !

JSON ?XML ?

Session!

Stateless

REST

API ?

Customize with URL Params

HTML Forms

Page 14: London XQuery Meetup: Querying the World (Web Scraping)

Website App

CSV !HTML !XLS !Zip !

JSON ?XML ?

Session!

Stateless

REST

API ?

Customize with URL Params

HTML Forms

Page 15: London XQuery Meetup: Querying the World (Web Scraping)

Website App

CSV !HTML !XLS !Zip !Session!

HTML Forms

HTML !

Session!

HTML Forms

XQuery !

Page 16: London XQuery Meetup: Querying the World (Web Scraping)

Summary

Page 17: London XQuery Meetup: Querying the World (Web Scraping)

XQuery Web Data Processing

A browser can do it?

XQuery can do it!

Session handling

Forms

!

!

Page 18: London XQuery Meetup: Querying the World (Web Scraping)

Result:http://www.unemployment.by/country