scraping with geb

68
June 1st - 3rd 2016 Copenhagen, Denmark SCRAPING WITH GEB

Upload: gr8conf

Post on 16-Jan-2017

286 views

Category:

Technology


2 download

TRANSCRIPT

June 1st - 3rd 2016Copenhagen, Denmark

SCRAPING WITH GEB

SERGIO DEL [email protected] @SDELAMO

IOS APP APP STORE

GR8CONFAGENDA GOOGLEPLAY

GROOVYCALAMARI.COM

A “weekly” curated email newsletter full of interesting, relevant links about the Groovy Ecosystem

GEBHTTP://GEBISH.ORG

http://www.webbotsspidersscreenscrapers.com

WHAT CAN YOU DO?

▸ PRICE-MONITORING WEBBOTS

▸ IMAGE-CAPTURING WEBBOTS

▸ LINK VERIFICATION WEBBOTS

▸ WEBBOTS THAT SEND EMAIL

▸ WEBBOTS THAT CONVERT A WEBSITE IN AN API

▸ SNIPERS

EXAMPLE 1 CREW SCRAPING

HTTPS://GITHUB.COM/SDELAMO/GEBWEBBOT_GR8CONF

GR8CONF API▸Only for agenda, no

sponsors or crew

java -jar gebwebbot_gr8conf-all.jar crew|sponsors destinationFolder outputFilename sqlite|plist|csv phantomjs.binary.path

GRADLE SHADOWHTTPS://GITHUB.COM/JOHNRENGELMAN/SHADOW

DESIRED OUTPUT

MODEL

MODEL

TEST YOUR FETCHER

SPOCK - TEST YOUR FETCHER

Define the interesting parts of your pages in a concise, maintanable and extensible manner

GEB PAGES

DIV.CREW

DIV.CREW

DIV.CREW

DIV.CREW

DIV.CREW

DIV.CREW

DIV.CREW

DIV.CREW

DIV.CREW

Modules are re-usable definitions of content that can be used across multiple pages

GEB MODULES

OUTPUT

EXAMPLE 2 SPONSORS SCRAPING

HTTPS://GITHUB.COM/SDELAMO/GEBWEBBOT_GR8CONF

GR8CONF API▸Only for agenda, no

sponsors or crew

DIV.SPONSORS

DIV.SPONSORS

H4DIV.SPONSOR

DIV.SPONSOR

DIV.SPONSOR

H4

DIV.SPONSORS

DIV.SPONSORS

H4DIV.SPONSOR

DIV.SPONSOR

DIV.SPONSOR

H4

DIV.SPONSORS

DIV.SPONSORS

H4DIV.SPONSOR

DIV.SPONSOR

DIV.SPONSOR

H4

DIV.SPONSORS

DIV.SPONSORS

H4DIV.SPONSOR

DIV.SPONSOR

DIV.SPONSOR

H4

GEB EXAMPLE GRADLEHTTPS://GITHUB.COM/GEB/GEB-EXAMPLE-GRADLE

The following commands will launch the tests with the individual browsers: ./gradlew chromeTest ./gradlew firefoxTest ./gradlew phantomJsTest To run with all, you can run: ./gradlew test

MARCIN ERDMANN

GEB.CONFIG

EXAMPLE 3 PAGINATION

HTTPS://GITHUB.COM/SDELAMO/WEBBOT_GEB_MEETUP_MEMBERS

DYNAMIC URL http://www.meetup.com/es-ES/madrid-gug/members/49149882/

BASE URL:MEETUP GROUP SLUG:

MEMBER ID:

http://www.meetup.com/madrid-gug28938802

PAGINATION

.PAGINATION.NAV-NEXT

PAGINATION

PAGINATION MODULE

HARVEST AND VISIT

HARVEST LINKS

SPLIT LOAD BETWEEN WEBBOTS

HTTPS://HTTPSTATUSDOGS.COM

SPLIT LOAD BETWEEN WEBBOTS1 2 53 4

11 12 1513 14

21 22 2523 24

31 32 3533 34

41 42 4543 44

6 7 108 9

16 17 2018 19

26 27 3028 29

36 37 4038 39

46 47 5048 49

def ids = 1..50 def webbotIndex = 3 def webbotsInParallel = 6 int total = ids.size() def sublistsSize = (total / webbotsInParallel) as int def s = ids.collate(sublistsSize)[webbotIndex]

USER AGENT SPOOFING

USER AGENT SPOOFING

HTTPS://GITHUB.COM/SDELAMO/GEBWEBBOT_USERAGENT

USER AGENT SPOOFING

EXAMPLE 4 HIDDEN CONTENT AND

ON MOUSE OVER EVENTS

FAILS: HIDDEN CONTENT

CALL A JS METHOD

MOVE TO ELEMENT

INCLUDE LIBRARY

UI INTERACTION

KEYBOARD

SLIDERS

STEALTH MEANS SIMULATING HUMAN PATTERNS

▸ BE KIND TO YOUR RESOURCES

▸ RUN YOUR WEBBOTS DURING BUSY HOURS

▸ DON’T RUN YOUR WEBBOTS AT THE SAME TIME EACH DAY

▸ DON’T RUN YOUR WEBBOT ON HOLIDAYS AND WEEKENDS

▸ USE RANDOM, INTRA-FETCH DELAYS

SIMULATE HUMAN CLICK RHYTHM

?