growing web spiders - vilniusphp

42
GROWING WEB SPIDERS Juozas Kaziukėnas // juokaz.com // @juokaz

Upload: juozas-kaziukenas

Post on 14-May-2015

2.350 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Growing web spiders - VilniusPHP

GROWINGWEB SPIDERS

Juozas Kaziukėnas // juokaz.com // @juokaz

Page 2: Growing web spiders - VilniusPHP

300’000’000 products/ 24 hours = 12’500’00 products/ 3600 seconds = 3’472 products

/ 3000 nodes = 1.1 sec. per product24’000 cores on Amazon = $300/h

Page 3: Growing web spiders - VilniusPHP

Juozas Kaziukėnas, Lithuanian

You can call me Joe

More info http://juokaz.com

Page 4: Growing web spiders - VilniusPHP

WHY CRAWL?

Page 5: Growing web spiders - VilniusPHP

WE NEED DATA1. Get data2. ???3. Profit

Page 6: Growing web spiders - VilniusPHP

IF PEOPLE ARE SCRAPPING YOUR SITE, YOU HAVE DATA PEOPLE WANT. CONSIDER

MAKING AN APIRussell Ahlstrom

Page 7: Growing web spiders - VilniusPHP

DATA SCIENCE

Page 8: Growing web spiders - VilniusPHP

1. FIGURE OUT WHAT TO REQUEST2. MAKE A REQUEST3. PARSE THE REQUEST4. STORE RESULTS

Page 9: Growing web spiders - VilniusPHP

WHAT TO EXTRACT

Page 10: Growing web spiders - VilniusPHP

AS LITTLE AS POSSIBLE

Page 11: Growing web spiders - VilniusPHP

MAKE A REQUEST

Page 12: Growing web spiders - VilniusPHP

FILE_GET_CONTENTS($URL);

Page 13: Growing web spiders - VilniusPHP

HANDLING HTTP ERRORS

Page 14: Growing web spiders - VilniusPHP

OPTIMIZE HTTP REQUESTS

Page 15: Growing web spiders - VilniusPHP

function get($url) { // Create a handle. $handle = curl_init($url);  // Set options...  // Do the request. $ret = curl_exec($handle);  // Do stuff with the results...  // Destroy the handle. curl_close($handle);}

Page 16: Growing web spiders - VilniusPHP

function get($url) { // Create a handle. $handle = curl_init($url);  // Set options...  // Do the request. $ret = curlExecWithMulti($handle);  // Do stuff with the results...  // Destroy the handle. curl_close($handle); }

Page 17: Growing web spiders - VilniusPHP

function curlExecWithMulti($handle) { // In real life this is a class variable. static $multi = NULL;  // Create a multi if necessary. if (empty($multi)) { $multi = curl_multi_init(); }  // Add the handle to be processed. curl_multi_add_handle($multi, $handle);  // Do all the processing. $active = NULL; do { $ret = curl_multi_exec($multi, $active); } while ($ret == CURLM_CALL_MULTI_PERFORM);  while ($active && $ret == CURLM_OK) { if (curl_multi_select($multi) != -1) { do { $mrc = curl_multi_exec($multi, $active); } while ($mrc == CURLM_CALL_MULTI_PERFORM); } }  // Remove the handle from the multi processor. curl_multi_remove_handle($multi, $handle);  return TRUE; }

Page 18: Growing web spiders - VilniusPHP

QUEUES FOR EVERYTHING

Page 19: Growing web spiders - VilniusPHP

ASYNCHRONOUS PROCESSING

Page 20: Growing web spiders - VilniusPHP

DO NOT BLOCK FOR I/O

Page 21: Growing web spiders - VilniusPHP

RETRIES

Page 22: Growing web spiders - VilniusPHP

REGULAR EXPRESSIONS

Page 23: Growing web spiders - VilniusPHP

REGULAR EXPRESSIONS NOT

Page 24: Growing web spiders - VilniusPHP

XPATH

Page 25: Growing web spiders - VilniusPHP

PHANTOM.JS/SELENIUM

Page 26: Growing web spiders - VilniusPHP

WHAT HAPPENS WHEN THE PAGE CHANGES

Page 27: Growing web spiders - VilniusPHP

ACTING LIKE A HUMAN

Page 28: Growing web spiders - VilniusPHP

HTTP HEADERS

Page 29: Growing web spiders - VilniusPHP

$HEADER = ARRAY();$HEADER[0] = "ACCEPT: TEXT/XML,APPLICATION/XML,APPLICATION/XHTML+XML,";$HEADER[0] .= "TEXT/HTML;Q=0.9,TEXT/PLAIN;Q=0.8,IMAGE/PNG,*/*;Q=0.5";$HEADER[] = "CACHE-CONTROL: MAX-AGE=0";$HEADER[] = "CONNECTION: KEEP-ALIVE";$HEADER[] = "KEEP-ALIVE: 300";$HEADER[] = "ACCEPT-CHARSET: ISO-8859-1,UTF-8;Q=0.7,*;Q=0.7";$HEADER[] = "ACCEPT-LANGUAGE: EN-US,EN;Q=0.5";$HEADER[] = "PRAGMA: "; // BROWSERS KEEP THIS BLANK.

CURL_SETOPT($CURL, CURLOPT_USERAGENT, 'MOZILLA/5.0 (WINDOWS; U; WINDOWS NT 5.2; EN-US; RV:1.8.1.7) GECKO/20070914 FIREFOX/2.0.0.7');CURL_SETOPT($CURL, CURLOPT_HTTPHEADER, $HEADER);

Page 30: Growing web spiders - VilniusPHP

COOKIES AND SESSIONScurl_setopt($curl,CURLOPT_COOKIEJAR, $cookieJar);

curl_setopt($curl,CURLOPT_COOKIEFILE, $cookieJar);

Page 31: Growing web spiders - VilniusPHP

AVOIDING GETTING BLOCKED

Page 32: Growing web spiders - VilniusPHP

DO NOT DDOS

Page 33: Growing web spiders - VilniusPHP

PROXY NETWORKHAProxy

Page 34: Growing web spiders - VilniusPHP

ACT LIKE A HUMAN BROWSING THE PAGEcurl_setopt($curl,CURLOPT_AUTOREFERER, true);

Page 35: Growing web spiders - VilniusPHP

ROBOTS.TXT

Page 36: Growing web spiders - VilniusPHP

LEGAL ISSUES

Page 37: Growing web spiders - VilniusPHP

YOU ARE GOING TO GET SUED

Page 38: Growing web spiders - VilniusPHP

MEASURE EVERYTHING

Page 39: Growing web spiders - VilniusPHP

1. Response time2. Response size3. HTTP error type4. Retries count5. Failing proxy IP6. Failing parsing7. etc.

Page 40: Growing web spiders - VilniusPHP

OPTIMIZE AND REPEAT

Page 41: Growing web spiders - VilniusPHP

WEB CRAWLING FOR FUN AND PROFIT

Page 42: Growing web spiders - VilniusPHP

THANKS!Juozas Kaziukėnas

@juokaz