Download - Growing web spiders - VilniusPHP
![Page 1: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/1.jpg)
GROWINGWEB SPIDERS
Juozas Kaziukėnas // juokaz.com // @juokaz
![Page 2: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/2.jpg)
300’000’000 products/ 24 hours = 12’500’00 products/ 3600 seconds = 3’472 products
/ 3000 nodes = 1.1 sec. per product24’000 cores on Amazon = $300/h
![Page 3: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/3.jpg)
Juozas Kaziukėnas, Lithuanian
You can call me Joe
More info http://juokaz.com
![Page 4: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/4.jpg)
WHY CRAWL?
![Page 5: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/5.jpg)
WE NEED DATA1. Get data2. ???3. Profit
![Page 6: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/6.jpg)
IF PEOPLE ARE SCRAPPING YOUR SITE, YOU HAVE DATA PEOPLE WANT. CONSIDER
MAKING AN APIRussell Ahlstrom
![Page 7: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/7.jpg)
DATA SCIENCE
![Page 8: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/8.jpg)
1. FIGURE OUT WHAT TO REQUEST2. MAKE A REQUEST3. PARSE THE REQUEST4. STORE RESULTS
![Page 9: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/9.jpg)
WHAT TO EXTRACT
![Page 10: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/10.jpg)
AS LITTLE AS POSSIBLE
![Page 11: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/11.jpg)
MAKE A REQUEST
![Page 12: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/12.jpg)
FILE_GET_CONTENTS($URL);
![Page 13: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/13.jpg)
HANDLING HTTP ERRORS
![Page 14: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/14.jpg)
OPTIMIZE HTTP REQUESTS
![Page 15: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/15.jpg)
function get($url) { // Create a handle. $handle = curl_init($url); // Set options... // Do the request. $ret = curl_exec($handle); // Do stuff with the results... // Destroy the handle. curl_close($handle);}
![Page 16: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/16.jpg)
function get($url) { // Create a handle. $handle = curl_init($url); // Set options... // Do the request. $ret = curlExecWithMulti($handle); // Do stuff with the results... // Destroy the handle. curl_close($handle); }
![Page 17: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/17.jpg)
function curlExecWithMulti($handle) { // In real life this is a class variable. static $multi = NULL; // Create a multi if necessary. if (empty($multi)) { $multi = curl_multi_init(); } // Add the handle to be processed. curl_multi_add_handle($multi, $handle); // Do all the processing. $active = NULL; do { $ret = curl_multi_exec($multi, $active); } while ($ret == CURLM_CALL_MULTI_PERFORM); while ($active && $ret == CURLM_OK) { if (curl_multi_select($multi) != -1) { do { $mrc = curl_multi_exec($multi, $active); } while ($mrc == CURLM_CALL_MULTI_PERFORM); } } // Remove the handle from the multi processor. curl_multi_remove_handle($multi, $handle); return TRUE; }
![Page 18: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/18.jpg)
QUEUES FOR EVERYTHING
![Page 19: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/19.jpg)
ASYNCHRONOUS PROCESSING
![Page 20: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/20.jpg)
DO NOT BLOCK FOR I/O
![Page 21: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/21.jpg)
RETRIES
![Page 22: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/22.jpg)
REGULAR EXPRESSIONS
![Page 23: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/23.jpg)
REGULAR EXPRESSIONS NOT
![Page 24: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/24.jpg)
XPATH
![Page 25: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/25.jpg)
PHANTOM.JS/SELENIUM
![Page 26: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/26.jpg)
WHAT HAPPENS WHEN THE PAGE CHANGES
![Page 27: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/27.jpg)
ACTING LIKE A HUMAN
![Page 28: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/28.jpg)
HTTP HEADERS
![Page 29: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/29.jpg)
$HEADER = ARRAY();$HEADER[0] = "ACCEPT: TEXT/XML,APPLICATION/XML,APPLICATION/XHTML+XML,";$HEADER[0] .= "TEXT/HTML;Q=0.9,TEXT/PLAIN;Q=0.8,IMAGE/PNG,*/*;Q=0.5";$HEADER[] = "CACHE-CONTROL: MAX-AGE=0";$HEADER[] = "CONNECTION: KEEP-ALIVE";$HEADER[] = "KEEP-ALIVE: 300";$HEADER[] = "ACCEPT-CHARSET: ISO-8859-1,UTF-8;Q=0.7,*;Q=0.7";$HEADER[] = "ACCEPT-LANGUAGE: EN-US,EN;Q=0.5";$HEADER[] = "PRAGMA: "; // BROWSERS KEEP THIS BLANK.
CURL_SETOPT($CURL, CURLOPT_USERAGENT, 'MOZILLA/5.0 (WINDOWS; U; WINDOWS NT 5.2; EN-US; RV:1.8.1.7) GECKO/20070914 FIREFOX/2.0.0.7');CURL_SETOPT($CURL, CURLOPT_HTTPHEADER, $HEADER);
![Page 30: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/30.jpg)
COOKIES AND SESSIONScurl_setopt($curl,CURLOPT_COOKIEJAR, $cookieJar);
curl_setopt($curl,CURLOPT_COOKIEFILE, $cookieJar);
![Page 31: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/31.jpg)
AVOIDING GETTING BLOCKED
![Page 32: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/32.jpg)
DO NOT DDOS
![Page 33: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/33.jpg)
PROXY NETWORKHAProxy
![Page 34: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/34.jpg)
ACT LIKE A HUMAN BROWSING THE PAGEcurl_setopt($curl,CURLOPT_AUTOREFERER, true);
![Page 35: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/35.jpg)
ROBOTS.TXT
![Page 36: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/36.jpg)
LEGAL ISSUES
![Page 37: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/37.jpg)
YOU ARE GOING TO GET SUED
![Page 38: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/38.jpg)
MEASURE EVERYTHING
![Page 39: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/39.jpg)
1. Response time2. Response size3. HTTP error type4. Retries count5. Failing proxy IP6. Failing parsing7. etc.
![Page 40: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/40.jpg)
OPTIMIZE AND REPEAT
![Page 41: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/41.jpg)
WEB CRAWLING FOR FUN AND PROFIT
![Page 42: Growing web spiders - VilniusPHP](https://reader033.vdocuments.mx/reader033/viewer/2022052823/5553b535b4c905d9448b4ce4/html5/thumbnails/42.jpg)
THANKS!Juozas Kaziukėnas
@juokaz