gdg İstanbul Şubat etkinliği - sunum

18
Web Crawling Web Scraping cuneytykaya cuneyt.yesilkaya

Upload: cueneyt-yesilkaya

Post on 22-Nov-2014

939 views

Category:

Documents


4 download

DESCRIPTION

GDG İstanbul Şubat (23.02.2013) etkinliği kapsamında "Web Crawling" ve "Web Scraping" ile ilgili yaptığım sunum.

TRANSCRIPT

Page 1: GDG İstanbul Şubat Etkinliği - Sunum

Web Crawling Web Scraping

cuneytykaya

cuneyt.yesilkaya

Page 2: GDG İstanbul Şubat Etkinliği - Sunum

Cüneyt Yeşilkaya

2007

2048

......... 20102012

Page 3: GDG İstanbul Şubat Etkinliği - Sunum

Agenda

● Web Crawling● Web Scraping● Web Crawling Tools● Demo (Crawler4j & Jsoup)● Crawling - Where to Use

Page 4: GDG İstanbul Şubat Etkinliği - Sunum

Web Crawling

Browsing the World Wide Web in a methodical, automated manner or in an orderly fashion.

Page 5: GDG İstanbul Şubat Etkinliği - Sunum

Web Scraping

Computer software technique of extracting information from websites.

Page 6: GDG İstanbul Şubat Etkinliği - Sunum

Web Crawling Tools

Page 7: GDG İstanbul Şubat Etkinliği - Sunum

Selecting Crawler ?

● Multi-Threaded Structure● Max Page to Fetch● Max Page Size● Max Depth to Crawl● Redundant Link Control● Politeness Time● Resumable● Well-Documented

Page 8: GDG İstanbul Şubat Etkinliği - Sunum

Crawler4j

Yasser Ganjisaffar

Microsoft Bing & Microsoft Live Search

Page 9: GDG İstanbul Şubat Etkinliği - Sunum

Demo - Crawler4j (1/3)

myCrawler.java myController.java

Page 10: GDG İstanbul Şubat Etkinliği - Sunum

Demo - Crawler4j (2/3)

myCrawler.java

import edu.uci.ics.crawler4j.crawler.WebCrawler; public class myCrawler extends WebCrawler { @Override public boolean shouldVisit(WebURL url) { return url.getURL().startsWith("http://www.gdgistanbul.com"); } @Override public void visit(Page page) { String url = page.getWebURL().getURL(); }}

Page 11: GDG İstanbul Şubat Etkinliği - Sunum

Demo - Crawler4j (3/3)

myController.java

int numberOfCrawlers = 4; CrawlConfig config = new CrawlConfig(); config.setPolitenessDelay(250); config.setMaxPagesToFetch(100); PageFetcher pageFetcher = new PageFetcher(config); RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher); CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); controller.addSeed("http://www.gdgistanbul.com"); controller.start(myCrawler.class, numberOfCrawlers);

Page 12: GDG İstanbul Şubat Etkinliği - Sunum

Demo - Jsoup (1/2)Jsoup : nice way to do HTML Parsing in Java

● scrape and parse HTML from a URL, file, or string● find and extract data, using DOM traversal or CSS selectors● manipulate the HTML elements, attributes, and text

Page 13: GDG İstanbul Şubat Etkinliği - Sunum

Demo - Jsoup (2/2)Document doc = Jsoup.connect("http://en.wikipedia.org/").get();Elements newsHeadlines = doc.select("#mp-itn b a");

String html = "<html><head><title>First parse</title></head>" + "<body><p>Parsed HTML into a doc.</p></body></html>";Document doc = Jsoup.parse(html);

Element content = doc.getElementById("content");Elements links = content.getElementsByTag("a");for (Element link : links) {

String linkHref = link.attr("href");String linkText = link.text();

}Elements links = doc.select("a[href]");Elements media = doc.select("[src]");

Page 14: GDG İstanbul Şubat Etkinliği - Sunum

Where to Use

● Search Engines (GoogleBot)● Aggregators

○ Data aggregator○ News aggregator○ Review aggregator○ Search aggregator○ Social network aggregation○ Video aggregator

● Kaarun Product Collector

Page 15: GDG İstanbul Şubat Etkinliği - Sunum

www.kaarun.com

Page 16: GDG İstanbul Şubat Etkinliği - Sunum

All Friends

Page 17: GDG İstanbul Şubat Etkinliği - Sunum

Products for each Facebook Like

Page 18: GDG İstanbul Şubat Etkinliği - Sunum

cyesilkaya.wordpress.com & @cuneytykaya & tr.linkedin/cuneyt.yesilkaya

Teşekkürler...