web crawl with elixir

38
WEB CRAWL with Elixir

Post on 05-Apr-2017

178 views

Category:

Technology


0 download

TRANSCRIPT

WEB CRAWLwith Elixir

Who am I?

• Jechol Lee ([email protected])

• Software engineer at Skelterlabs

• Loves elixir, elm, ruby

• We are HIRING!

First try

First try

First try

Database

Crawler JSONHTTP

Crawler HTMLHTTP

Produc

ts

Products SaveQueue

Products

SQLSite A ES

SaveQueue

Task

Page

Task

Item

Task

Item

product

req

Sup

Sup(site A)

Sup(simple1 for 1)

Task.SupTask.Sup

Sup(site B)

progress

start_link

Crawler speed > Site rate limit

Crawler WebsiteHTTP 10/s

Blocked

Crawler WebsiteHTTP 10/s

BLOCKED

DB speed < Crawler speed

SaveQueueDatabase Crawler100/s 200/s

Memory exhaustion

SaveQueueDatabase Crawler100/s 200/s

SaveQueue

Out of Memory

Solution

OOM Problem : Producer driven

200/s CrawlerSaveQueueDatabase 100/s

GenStage : Demand-driven

CrawlerSaveQueue

DEMAND 100

GenStage : Demand-driven

CrawlerSaveQueue

DEMAND 100

100

Elixir GenStage (2016 / 7)

GenStage : Demand-driven

CrawlerSaveQueue

DEMAND 100

100Database 100/s

Memory usage

Producer driven

Demand driven

Site rate limit : TokenBucket

Site rate limit : TokenBucket

Crawler WebsiteHTTP Burst TokenBucket

60/min

Network Requests Overflow

Page 1

Network Requests Overflow

Page 2

Page 1 item

Page 1 item

Page 1 item

Network Requests Overflow

Page 1 item

Page 2 item

Page 3

Page 1 item

Page 2 item

Page 2 item

Network Requests Overflow

Can't depend on random processing order.

Page 1 item

Page 2 item

Page 4

Page 3 item

Page 1 item

Page 2 itemPage 3 item

Page 3 item

Priority Queue

Page 1 item

Page 2 item

Page 3

Page 1 item

Page 1 item

Page 2 item

Page 2 item

Revised Architecture

Existing Architecture

SQL

ES

SaveQueue

HTMLCrawler

C21

JSONCrawler

Lego

Demand-driven

PRODUCT

DEMAND

DEMANDPRODUCT

SQL

ES

SaveQueue

HTMLCrawler

C21

JSONCrawler

Lego

Rate limit by TokenBucket

twotap.com

c21stores.com

WebProxy

PriorityQueue+

TokenBucket

HTTP

SPAWN

HTML Parser

C21

HTML Parser

C21

HTMLCrawler

C21

JSONCrawler

Lego

Error monitoring

ErrorMonitor

sentry.io

MONITOR

{:DOWN, :page_not_found}

HTMLCrawler

C21

Fault-tolerance by Supervision Tree

Supervisor

Supervisor

C21

TaskSupervisor

ErrorMonitor

WebProxy

PriorityQueue+

TokenBucket

HTML Parser

C21

HTML Parser

C21SaveQueue

HTMLCrawler

C21

JSONCrawler

Lego

Tree for Multiple Crawlers

JSONCrawler

GNC

Supervisor

JCPenney

Supervisor

Supervisor

C21

TaskSupervisor

ErrorMonitor

WebProxy

PriorityQueue+

TokenBucket

HTML Parser

C21

HTML Parser

C21SaveQueue

HTMLCrawler

C21

JSONCrawler

Lego

Final

JSONCrawler

GNC

Supervisor

JCPenney

Supervisor

Supervisor

C21

TaskSupervisor

ErrorMonitor

sentry.io

MONITOR

{:DOWN, :page_not_found}

twotap.com

c21stores.com

WebProxy

PriorityQueue+

TokenBucket

HTTP

SPAWN

HTML Parser

C21

HTML Parser

C21

PRODUCT

DEMAND

DEMANDPRODUCT

SQL

ES

SaveQueue

HTMLCrawler

C21

JSONCrawler

Lego

Building Blocks

GenStage

Task

Task.Supervisor

GenServer

Supervisor

Agent

GenServer vs Task

• Tasks don't provide services.→ No handle_call, etc.

• Just run a function and exit.

Task.Supervisor.async

• Not trap exit.

• Caller process dies together.

• Not restart task.

Task.async

vs

Task.Supervisor.async

Only later builds supervision relationship

so that visible using observer.

End