call for tender discovery zhen zheng. ir on the web crawlers parallel crawler intelligent crawler...

33
Call For Tender Call For Tender Discovery Discovery Zhen Zheng Zhen Zheng

Upload: hector-chase

Post on 16-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools

Call For Tender DiscoveryCall For Tender Discovery

Zhen ZhengZhen Zheng

Page 2: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools

IR on the Web

Crawlers

parallel crawler

intelligent crawler Domain Specific Web Searching (CFT.)

Development tools

References

Page 3: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools

IR on the WebCFT. = Tenders search on the web = IR on the Web

Query

Search & match Indexed

files

Web CrawlersWeb pages

Document

Processor

Page

rankingResponses

Browse

Query

Processor

Page 4: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools

parallel crawler

• Partitioning the Web

partition function, crawling modes

• Evaluation Metrics

overlap, coverage, communication

overhead

Page 5: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools

intelligent crawler [2]

• Best-first crawling• Focused crawling is best

ie. Only tender related sites/pages

• Based on Linkage locality

• Based on sibling locality • Based on URL tokens

ie. www.city.kingston.on.ca/cityhall /tenders/index.asp

www.orangeville.org/tenders.php www.tenderscancada.com

contactscanada.gc.ca/en/tender-e.htm etc.

• Based on html tag, like title,meta• Based on web page content • Based on page score ….

Page 6: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools

Domain Specific Web Searching [1]

Features:• Easy to apply to heuristic search• Use meta search engine• Implement on the fly• Collaborative,parallel search• Apply to intelligent, agent technologies etc. • Reduce the storage of downloaded web pages• Objective:find complete possibility

Page 7: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools

Domain Specific Web Searching (CFT.)

For tenders:• Use meta search engine at first, ie. Google API,Yahoo,MSN etc.• Geographic category, such as “ ** city tenders” • Use “Search” when crawling • Refine keyword ie, tender,tenders, city tenders etc.• Auto-fill form • Authority page • Hub page

Page 8: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools

Prototype:

Query /Seed URL yes

No

Start Crawling Search Form? URLs

URLs

Weight & Order

Focus Crawling

Page 9: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools

• Why fill search form first?

– Fast

– Directly

Most general web sites provide search function.

Such as:

yahoo, google, msn, altavista …

www.stjohns.ca; www.cityofkingston.ca; http://www.city.whitehorse.yk.ca …

Page 10: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools
Page 11: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools
Page 12: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools
Page 13: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools

• Why focus crawling?

– Accurate

– Complete

Shrink query response ---> filter ---> accurate

Enlarge query terms ---> query modification ---> complete

classification [5] [6]

hyperlink analysis [7]

For example: www.merx.ca

Page 14: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools
Page 15: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools
Page 16: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools
Page 17: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools

• Discuss

– Data page

– Hub page

Data page: term frequency, compond terms frequency.

such as: tender, tender / closing date …

Hub page(assume):

1) outbound links > 10

2) links/string tokens > 5%

3) language anchor text, as English , Français

• Snapshot next page

Page 18: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools

• SnapshotSnapshot

Page 19: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools
Page 20: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools
Page 21: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools

Development tools

• HTMLParser (v 1.4)A super-fast real-time parser for real-world HTML, simplicity in design, speed and ability to handle streaming real-world html. Parser,Scanner,FormTag, InputTag etc. open source.

• Google API

Create a Google API Account, get your Google Web APIs license key,1000

queries per day, top 10 results per query. Don’t provides ‘link:’ function. • Httpunit

The center of HttpUnit is the Web Conversation class, which takes the place of a browser talking to a single site. WebConversation, WebResponse, WebLink, WebTable, WebForm etc. open source.

Page 22: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools

Development tools

• LuceneJarkarta Lucene is a high-performance, full-featured, java, open source, text search engine API. Document, Field, IndexWriter, file-based or RAM-based Directory, Analyzers, Query, QueryParser, IndexSearcher, Hits etc.

• WebsphinxIt is a Java class library and interactive development environment for web crawlers. It consists of the Crawler Workbench and the WebSPHINX class library. It is intended more for personal use, to crawl perhaps only hundreds of web pages otherwise it can use up memory quickly. Because, it retains all the pages and links that it has crawled until you clear the crawler.

Page 23: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools

Development tools

• Swish-eSwish-e is Simple Web Indexing System for Humans - Enhanced

It is a search engine written by perl, efficient and fast.

• JavaCC-htmlparer (Quiotix Corporation) This is a JavaCC grammar for parsing HTML documents. The parser transforms an input stream into a parse tree; the elements of the parse tree are defined in HtmlDocument.  You can then traverse the tree using the Visitor pattern, like HtmlVisistor, HtmlDumper, HtmlCollector,HtmlScruber.

Page 24: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools

References

[1] “A Method for Indexing Web Pages Using Web Bots”

by Boleslaw K. Szymanski and Ming-Shu Chung, January 2002

Department of Computer Science Rensselaer Polytechnic Institute, Troy, N.Y. 12180-3590, USA

[2] “ Intelligent Crawling on the World Wide Web with Arbitrary Predicates” by Charu C. Aggarwal, Fatima Al-Garawi & Philip S.Yu,

IBM T.J. Watson Resch. Ctr Yorktown Heights,NY 10598 , www10 May 1-5,2001,HongKong. ACM 1-58113-348-0/01/0005

[3] “Extracting Logical Schema from the Web” by Vincenza Carchiolo et al, University of Catania

Applied Intelligence 18,341-355,2003

[4] “ Web Search – Your Way” by J.Glover Steve Lawrence et al.

NEC Research Institute 4 Independence Way Princeton, NJ 08540

Page 25: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools

• [5] “ Improving Catergory Specific Web Search by Learning Query Modifications”By Eric J. Glover Gary W. Flake et al.

NEC Research Institute, EECS Department University of Michigan

• [6] “ Web Search Using Automatic Classification”By Chandra Chekuri, Prabhakar Raghavan et al.

Computer Science Department, Stanford university; IBM Almaden Research Center

• [7] “ Enhanced hypertext categorization using hyperlinks”By Soumen Chakrabarti, Byron Dom, Piotr Indyk.

IBM Almaden Research Center ; Computer Science Department, Stanford university

Page 26: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools

-- Linkage locality -- Sibling locality

In practice, suppose we have page xyz.html, we can use google find its parent pages with link:xyz.html, then we can find xyz’s sibling pages following the links within its parent pages.

Y

topic X X

X

Y

Y

...

topicZ

topicZ

Z ?

topicZ

Z ?

Page 27: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools
Page 28: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools
Page 29: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools
Page 30: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools
Page 31: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools
Page 32: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools
Page 33: Call For Tender Discovery Zhen Zheng. IR on the Web Crawlers parallel crawler intelligent crawler Domain Specific Web Searching (CFT.) Development tools

Thanks!Thanks!