how to scrap any website's content using scrapy

20
HOW TO SCRAPE ANY WEBSITE FOR FUN ;) by Anton Rifco [email protected] Some pictures taken from internet. This article possess no copyright. Use it for your own purpose July 2013 Monday, 15 July, 13

Upload: anton-rifco-susilo

Post on 01-Dec-2015

2.533 views

Category:

Documents


6 download

DESCRIPTION

Tutorial of How to scrape (crawling) website's content using Scrapy Python

TRANSCRIPT

Page 1: How to Scrap Any Website's content using Scrapy

HOW TO SCRAPE ANY WEBSITE

FOR FUN ;)

by Anton [email protected]

Some pictures taken from internet. This article possess no copyright. Use it for your own purpose

July 2013

Monday, 15 July, 13

Page 2: How to Scrap Any Website's content using Scrapy

Let’s scrape this :)

Monday, 15 July, 13

Page 3: How to Scrap Any Website's content using Scrapy

Web Scraping, a process of automatically collecting (stealing?)

information from the Internet

Monday, 15 July, 13

Page 4: How to Scrap Any Website's content using Scrapy

THE TOOLYou need these tool to steal (uupss) those data:

Python (2.6 or 2.7) with some packages*Scrapy** framework

Google Chrome with XPath*** review pluginComputer, of courseand functional brain

*) http://doc.scrapy.org/en/latest/intro/install.html#requirements**) refer to http://scrapy.org/ (this slides won’t cover the installation of those things)

***) I use “XPath helper” plugin

Monday, 15 July, 13

Page 5: How to Scrap Any Website's content using Scrapy

S C R A P YNot Crappy

Scrapy is an application framework for crawling web sites and extracting structured data which can be used

for a wide range of useful applications, like data mining, information processing or historical archival.

Monday, 15 July, 13

Page 6: How to Scrap Any Website's content using Scrapy

S C R A P YNot Crappy

Scrapy works by creating logical spiders that will crawl to any website you like.

You define the logic of that spider, using Python

Scrapy uses a mechanism based on XPath expressions called XPath selectors.

Monday, 15 July, 13

Page 7: How to Scrap Any Website's content using Scrapy

XPath is W3C standard to navigate through XML document (so as HTML)

Here, XML documents are treated as trees of nodes. The topmost element of the tree is called the root element.

For more, refer to: http://www.w3schools.com/xpath/

X P A T HMonday, 15 July, 13

Page 8: How to Scrap Any Website's content using Scrapy

X P A T H

<?xml version="1.0" encoding="ISO-8859-1"?>

<bookstore> <book> <title lang="en">Harry Potter</title> <author>J K. Rowling</author> <year>2005</year> <price>29.99</price> </book></bookstore>

From example of nodes in the XML document above:<bookstore> (root element node)

<author>J K. Rowling</author> (element node)

lang="en" (attribute node)

For more, refer to: http://www.w3schools.com/xpath/

Monday, 15 July, 13

Page 9: How to Scrap Any Website's content using Scrapy

Selecting NodesXPath uses path expressions to select nodes in an XML document. The node is selected by following a path or steps

For more, refer to: http://www.w3schools.com/xpath/

Expression Resultnodename Selects all nodes with the name “nodename”

/ Do selection from the root

// Do selection from current node. Select current node.. Select parent node

@attr Select attributes of nodestext() Select the value of chosen node

X P A T HMonday, 15 July, 13

Page 10: How to Scrap Any Website's content using Scrapy

X P A T H

Predicate ExpressionsPredicates are used to find a specific node or a node that contains a specific value. Predicates are always embedded in square brackets.

Expression Result/bookstore/book[1] Selects the first book element that is the child of the bookstore element.

/bookstore/book[last()] Selects the last book element that is the child of the bookstore element

/bookstore/book[last()-1] Selects the last but one book element that is the child of the bookstore element

/bookstore/book[position()<3] Selects the first two book elements that are children of the bookstore element

//title[@lang] Selects all the title elements that have an attribute named lang

//title[@lang='eng'] Selects all the title elements that have an attribute named lang with a value of 'eng'

/bookstore/book[price>35.00] Selects all the book elements of the bookstore element that have a price element with a value greater than 35.00

Monday, 15 July, 13

Page 11: How to Scrap Any Website's content using Scrapy

X P A T H H E L P E R

By using XPATH Helper, you can easily get the XPath expression of a given node in HTML doc. It will be enabled by

pressing <Ctrl>+<Shift>+X on Chrome

Monday, 15 July, 13

Page 12: How to Scrap Any Website's content using Scrapy

R E A L A C T I O N

Create Scrapy Comesg project> scrapy startproject comesg

Then, it will create the following project directory structurecomesg/ /* This is Project root */ scrapy.cfg /* Project config file */ comesg/ __init__.py items.py /* Definition of Items to scrap */ pipelines.py /* Pipeline config for advance use*/ settings.py /* Advance setting file */ spiders/ /* Directory to put spiders file */ __init__.py ...

Monday, 15 July, 13

Page 13: How to Scrap Any Website's content using Scrapy

R E A L A C T I O N

Define the Information Items that we want to scrap

Click any of place, will open its details

So, of all those data, we want to collect:name of places, photo, description, address (if any), contact number (if any), opening hours (if any), website (if any), and video (if any)

Monday, 15 July, 13

Page 14: How to Scrap Any Website's content using Scrapy

from scrapy.item import Item, Field

class ComesgItem(Item): # define the fields for your item here like: name = Field() photo = Field() desc = Field() address = Field() contact = Field() hours = Field() website = Field() video = Field()

On items.py, write the following:

I t e m s D e f i n i t i o n

Monday, 15 July, 13

Page 15: How to Scrap Any Website's content using Scrapy

R E A L A C T I O N

Basically, here is our strategy

1. Implements first spider that will get url of the listed items

2. Crawl to that url one by one3. Implements second spider that will

fetch all the required data

Monday, 15 July, 13

Page 16: How to Scrap Any Website's content using Scrapy

class AttractionSpider(CrawlSpider):! name = "get-attraction"! allowed_domains = ["comesingapore.com"] ## Will never go outside playground! start_urls = [ ## Starting URL! ! "http://comesingapore.com/travel-guide/category/285/attractions"! ]! rules = ()

! def __init__(self, name=None, **kwargs):! ! super(AttractionSpider, self).__init__(name, **kwargs)! ! self.items_buffer = {}! ! self.base_url = "http://comesingapore.com"! ! from scrapy.conf import settings! ! settings.overrides['DOWNLOAD_TIMEOUT'] = 360 ## prevent too early timeout

! def parse(self, response):! ! print "Start scrapping Attractions...."! ! try:! ! ! hxs = HtmlXPathSelector(response) ## XPath expression to get the URL of item details! ! ! links = hxs.select("//*[@id='content']//a[@style='color:black']/@href") ! ! !! ! ! if not links:! ! ! ! return! ! ! ! log.msg("No Data to scrap")

! ! ! for link in links:! ! ! ! v_url = ''.join( link.extract() )! ! ! ! ! ! ! !! ! ! ! if not v_url:! ! ! ! ! continue! ! ! ! else: ## If valid URL, continue crawl those URL! ! ! ! ! _url = self.base_url + v_url ## real work handled by second spider ! ! ! ! ! yield Request( url= _url, callback=self.parse_details ) ! ! except Exception as e:! ! ! log.msg("Parsing failed for URL {%s}"%format(response.request.url))

F i r s t s p i d e r

Monday, 15 July, 13

Page 17: How to Scrap Any Website's content using Scrapy

def parse_details(self, response):! ! print "Start scrapping Detailed Info...."! ! try:! ! ! hxs = HtmlXPathSelector(response)! ! ! l_venue = ComesgItem()

! ! ! v_name = hxs.select("/html/body/div[@id='wrapper']/div[@id='page']/div[@id='page-bgtop']/div[@id='page-bgbtm']/div[@id='content']/div[3]/h1/text()").extract()! ! ! if not v_name:! ! ! ! v_name = hxs.select("/html/body/div[@id='wrapper']/div[@id='page']/div[@id='page-bgtop']/div[@id='page-bgbtm']/div[@id='content']/div[2]/h1/text()").extract()! ! !! ! ! l_venue["name"] = v_name[0].strip()! ! !! ! ! base = hxs.select("//*[@id='content']/div[7]")! ! ! if base.extract()[0].strip() == "<div style=\"clear:both\"></div>":! ! ! ! base = hxs.select("//*[@id='content']/div[8]")! ! ! elif base.extract()[0].strip() == "<div style=\"padding-top:10px;margin-top:10px;border-top:1px dotted #DDD;\">\n You must be logged in to add a tip\n </div>":! ! ! ! base = hxs.select("//*[@id='content']/div[6]")

! ! ! x_datas = base.select("div[1]/b").extract()! ! ! v_datas = base.select("div[1]/text()").extract()! ! ! i_d = 0;! ! ! if x_datas:! ! ! ! for x_data in x_datas:! ! ! ! ! print "data is:" + x_data.strip()! ! ! ! ! if x_data.strip() == "<b>Address:</b>":! ! ! ! ! ! l_venue["address"] = v_datas[i_d].strip()! ! ! ! ! if x_data.strip() == "<b>Contact:</b>":! ! ! ! ! ! l_venue["contact"] = v_datas[i_d].strip()! ! ! ! ! if x_data.strip() == "<b>Operating Hours:</b>":! ! ! ! ! ! l_venue["hours"] = v_datas[i_d].strip()! ! ! ! ! if x_data.strip() == "<b>Website:</b>":! ! ! ! ! ! l_venue["website"] = (base.select("div[1]/a/@href").extract())[0].strip()

! ! ! ! ! i_d += 1! ! ! !! ! ! v_photo = base.select("img/@src").extract()! ! ! if v_photo:! ! ! ! l_venue["photo"] = v_photo[0].strip()

! ! ! v_desc = base.select("div[3]/text()").extract()! ! ! if v_desc:! ! ! ! desc = ""! ! ! ! for dsc in v_desc:! ! ! ! ! desc += dsc! ! ! ! l_venue["desc"] = desc.strip()

S e c o n d s p i d e r

Monday, 15 July, 13

Page 18: How to Scrap Any Website's content using Scrapy

R E A L A C T I O N

Run the Project> scrapy crawl get-attraction -t csv -o attr.csv

In the end, it produces file attr.csv with the scraped data, like following:

> head -3 attr.csv website,name,photo,hours,contact,video,address,desc

http://www.tigerlive.com.sg,TigerLIVE,http://tn.comesingapore.com/img/others/240x240/f/6/0000246.jpg,Daily from 11am to 8pm (Last admission at 6.30pm).,(+65) 6270 7676,,"St. James Power Station, 3 Sentosa Gateway, Singapore 098544",

http://www.zoo.com.sg,Singapore Zoo,http://tn.comesingapore.com/img/others/240x240/6/2/0000098.jpg,Daily from 8.30am - 6pm (Last ticket sale at 5.30pm),(+65) 6269 3411,http://www.youtube.com/embed/p4jgx4yNY9I,"80 Mandai Lake Road, Singapore 729826","See exotic and endangered animals up close in their natural habitats in the . Voted the best attraction in Singapore on Trip Advisor, and considered one of the best zoos in the world, this attraction is a must see, housing over 2500 mammals, birds and reptiles.

Monday, 15 July, 13

Page 19: How to Scrap Any Website's content using Scrapy

Get the complete Project code @ https://github.com/antonrifco/comesg

Monday, 15 July, 13

Page 20: How to Scrap Any Website's content using Scrapy

THANK YOU!- Anton -

Monday, 15 July, 13