«scrapy internals» Александр Сибиряков, scrapinghub

Post on 22-Jan-2018

174 Views

Category:

Internet

5 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Scrapy internalsAlexander Sibiryakov, 16-17 July 2017, PyConRU 2017

sibiryakov@scrapinghub.com

made by

Talk scope

Talk scope• Design of complex asynchronous application,

Talk scope• Design of complex asynchronous application,

• Flow-control issues,

Talk scope• Design of complex asynchronous application,

• Flow-control issues,

• open source life.

Scrapy: web scraping

Scrapy: web scraping• extraction of structured data,

Scrapy: web scraping• extraction of structured data,

• Selecting and extracting data from HTML/XML (CSS, Xpath, regexps) → Parsel

Scrapy: web scraping• extraction of structured data,

• Selecting and extracting data from HTML/XML (CSS, Xpath, regexps) → Parsel

• Interactive shell,

Scrapy: web scraping• extraction of structured data,

• Selecting and extracting data from HTML/XML (CSS, Xpath, regexps) → Parsel

• Interactive shell,

• Feed exports in JSON, CSV, XML and storing in FTP, S3, local fs,

Scrapy: web scraping• extraction of structured data,

• Selecting and extracting data from HTML/XML (CSS, Xpath, regexps) → Parsel

• Interactive shell,

• Feed exports in JSON, CSV, XML and storing in FTP, S3, local fs,

• Robust encoding support and auto-detection,

Main features

Main features• Extensible: spider, signals, middlewares,

extensions, and pipelines,

Main features• Extensible: spider, signals, middlewares,

extensions, and pipelines,

Telnet console

Main features• Extensible: spider, signals, middlewares,

extensions, and pipelines,

Form submissionTelnet console

Main features• Extensible: spider, signals, middlewares,

extensions, and pipelines,

COOKIES

Form submissionTelnet console

Main features• Extensible: spider, signals, middlewares,

extensions, and pipelines,

COOKIES

Form submissionTelnet console

Graceful shutdown by signal

Main features• Extensible: spider, signals, middlewares,

extensions, and pipelines,

COOKIES

Robots.txt

Form submissionTelnet console

Graceful shutdown by signal

Scrapy architecture

Twisted

Twisted• Event-driven network programming framework

Twisted• Event-driven network programming framework

• Event loop and Deferreds («Обещания»)

Twisted• Event-driven network programming framework

• Event loop and Deferreds («Обещания»)

• Protocols and transport:

Twisted• Event-driven network programming framework

• Event loop and Deferreds («Обещания»)

• Protocols and transport:

• TCP, UDP, SSL, UNIX sockets

Twisted• Event-driven network programming framework

• Event loop and Deferreds («Обещания»)

• Protocols and transport:

• TCP, UDP, SSL, UNIX sockets

• HTTP, DNS, SMTP/IMAP, IRC

Twisted• Event-driven network programming framework

• Event loop and Deferreds («Обещания»)

• Protocols and transport:

• TCP, UDP, SSL, UNIX sockets

• HTTP, DNS, SMTP/IMAP, IRC

• Cross platform

Twisted• Event-driven network programming framework

• Event loop and Deferreds («Обещания»)

• Protocols and transport:

• TCP, UDP, SSL, UNIX sockets

• HTTP, DNS, SMTP/IMAP, IRC

• Cross platform

Creator of Twisted

Glyph LefkowitzCreator of Twisted

–Twisted source code

self._nameResolver = _SimpleResolverComplexifier(resolver)

Twisted event loop

https://stackoverflow.com/questions/35111265/how-does-pythons-twisted-reactor-workhttps://www.cs.cmu.edu/~adamchik/15-121/lectures/Binary%20Heaps/heaps.html

Twisted event loop

https://stackoverflow.com/questions/35111265/how-does-pythons-twisted-reactor-workhttps://www.cs.cmu.edu/~adamchik/15-121/lectures/Binary%20Heaps/heaps.html

Twisted event loop

https://stackoverflow.com/questions/35111265/how-does-pythons-twisted-reactor-workhttps://www.cs.cmu.edu/~adamchik/15-121/lectures/Binary%20Heaps/heaps.html

events: [e1: Event, e2: Event, … eN]Event: func, args, desired_time

Twisted event loop

https://stackoverflow.com/questions/35111265/how-does-pythons-twisted-reactor-workhttps://www.cs.cmu.edu/~adamchik/15-121/lectures/Binary%20Heaps/heaps.html

events: [e1: Event, e2: Event, … eN]Event: func, args, desired_time min: O(1)

x86 time sources

x86 time sources• Real Time Clock - absolute time, 1 sec. precision,

x86 time sources• Real Time Clock - absolute time, 1 sec. precision,

• 8254 chip previously,

x86 time sources• Real Time Clock - absolute time, 1 sec. precision,

• 8254 chip previously,

• HPET (High Precision Event Timer), at least 10Mhz

x86 time sources• Real Time Clock - absolute time, 1 sec. precision,

• 8254 chip previously,

• HPET (High Precision Event Timer), at least 10Mhz

• single counter for periodic mode,

x86 time sources• Real Time Clock - absolute time, 1 sec. precision,

• 8254 chip previously,

• HPET (High Precision Event Timer), at least 10Mhz

• single counter for periodic mode,

• many for one-shot mode,

x86 time sources• Real Time Clock - absolute time, 1 sec. precision,

• 8254 chip previously,

• HPET (High Precision Event Timer), at least 10Mhz

• single counter for periodic mode,

• many for one-shot mode,

• compares actual timer value and target

x86 time sources• Real Time Clock - absolute time, 1 sec. precision,

• 8254 chip previously,

• HPET (High Precision Event Timer), at least 10Mhz

• single counter for periodic mode,

• many for one-shot mode,

• compares actual timer value and target

• RDTSC/RDTSCP - CPU clock cycles

x86 time sources• Real Time Clock - absolute time, 1 sec. precision,

• 8254 chip previously,

• HPET (High Precision Event Timer), at least 10Mhz

• single counter for periodic mode,

• many for one-shot mode,

• compares actual timer value and target

• RDTSC/RDTSCP - CPU clock cycles

• Proprietary timers

Twisted.Deferred

Twisted.Deferred• callback

Twisted.Deferred• callback

• errback

Twisted.Deferred• callback

• errback

• addCallback, addErrback

Twisted.Deferred• callback

• errback

• addCallback, addErrback

• cancel

Twisted.Deferred• callback

• errback

• addCallback, addErrback

• cancel

• addTimeout

Twisted.Deferred• callback

• errback

• addCallback, addErrback

• cancel

• addTimeout

• pause/unpause

Internal components intercommunication

Web agent pipeline

DownloaderSlots:

PROBLEMS

Throttling between internal components

Throttling between internal components

• Downloader,

Throttling between internal components

• Downloader,

• Scraper

Throttling between internal components

• Downloader,

• Scraper

• Item pipelines (cleansing, validating, dups, storing,..)

Throttling between internal components

• Downloader,

• Scraper

• Item pipelines (cleansing, validating, dups, storing,..)

• Feed exports (serialization + disk/network IO)

Throttling between internal components

• Downloader,

• Scraper

• Item pipelines (cleansing, validating, dups, storing,..)

• Feed exports (serialization + disk/network IO)

• ?

Flow control: memory

Flow control: memory

Flow control: memory• Unlimited downloading -> unlimited items growth

from cascading feed pages.

Flow control: memory• Unlimited downloading -> unlimited items growth

from cascading feed pages.

• maintain limit per amount of memory used for Responses in queue (~5Mb)

Flow control: CPUspending more time on

than

> reactor.callLater( 0.1 , d.errback, _failure)

an artificial delay in 100ms

Callbacks-> CPU

io

Summarizing

Summarizing• concurrent items limits,

Summarizing• concurrent items limits,

• memory consumption limits,

Summarizing• concurrent items limits,

• memory consumption limits,

• scheduling of new calls with delays.

Summarizing• concurrent items limits,

• memory consumption limits,

• scheduling of new calls with delays.

if limit is reached ->

Summarizing• concurrent items limits,

• memory consumption limits,

• scheduling of new calls with delays.

if limit is reached ->

don’t pickup new request from scheduler

It just stopped…

It just stopped…• Why?

It just stopped…• Why?

• Some Deferred was lost?

It just stopped…• Why?

• Some Deferred was lost?

• Where?

It just stopped…• Why?

• Some Deferred was lost?

• Where?

• How to debug?

It just stopped…• Why?

• Some Deferred was lost?

• Where?

• How to debug?

No silver bullet.

It just stopped…• Why?

• Some Deferred was lost?

• Where?

• How to debug?

No silver bullet.

> self.heartbeat = task.LoopingCall(nextcall.schedule)

It just stopped…• Why?

• Some Deferred was lost?

• Where?

• How to debug?

No silver bullet.

> self.heartbeat = task.LoopingCall(nextcall.schedule)

+ extensive logging

Design your async application well

Design your async application well

Iterations

Design your async application well

Iterations

State diagrams

ВопросыAlexander Sibiryakov, Scrapinghub Ltd.,

sibiryakov@scrapinghub.com

top related