web scale crawling with apache nutch
TRANSCRIPT
2 / 30DigitalPebble Ltd
Based in Bristol (UK) Specialised in Text Engineering
– Web Crawling– Natural Language Processing– Information Retrieval– Data Mining
Strong focus on Open Source & Apache ecosystem User | Contributor | Committer
– Nutch, SOLR, Lucene – Tika– GATE, UIMA– Mahout– Behemoth
3 / 30Outline
Overview Features Data Structures Use cases
What's new in Nutch 1.3 Nutch 2.0 GORA
Conclusion
4 / 30Nutch?
“Distributed framework for large scale web crawling”– but does not have to be large scale at all– or even on the web (file-protocol)
Based on Apache Hadoop
Indexing and Search
Open Source – Apache 2.0 License
5 / 30Short history
2002/2003 : Started By Doug Cutting & Mike Caffarella
2004 : sub-project of Lucene @Apache
2005 : MapReduce implementation in Nutch
– 2006 : Hadoop sub-project of Lucene @Apache
2006/7 : Parser and MimeType in Tika
– 2008 : Tika sub-project of Lucene @Apache
May 2010 : TLP project at Apache
June 2011 (?) : Nutch 1.3
Q4 2011 (?) : Nutch 2.0
6 / 30In a Nutch Shell (1.3)
1) Inject → populates CrawlDB from seed list
2) Generate → Selects URLS to fetch in segment
3) Fetch → Fetches URLs from segment
4) Parse → Parses content (text + metadata)
5) UpdateDB → Updates CrawlDB (new URLs, new status...)
6) InvertLinks → Build Webgraph
7) SOLRIndex → Send docs to SOLR
8) SOLRDedup → Remove duplicate docs based on signature
Step by Step :
Or use the all-in-one 'nutch crawl' command
Repeat steps 2 to 8
7 / 30Frontier expansion
Manual “discovery”– Adding new URLs by
hand, “seeding”
Automatic discovery of new resources (frontier expansion)– Not all outlinks are
equally useful - control– Requires content
parsing and link extraction
seed
i = 1
i = 2
i = 3
[Slide courtesy of A. Bialecki]
8 / 30Outline
Overview Features Data Structures Use cases
What's new in Nutch 1.3 Nutch 2.0 GORA
Conclusion
9 / 30An extensible framework
Endpoints– Protocol– Parser– HtmlParseFilter– ScoringFilter (used in various places)– URLFilter (ditto)– URLNormalizer (ditto)– IndexingFilter
Plugins– Activated with parameter 'plugin.includes'– Implement one or more endpoints
10 / 30Features
Fetcher– Multi-threaded fetcher– Follows robots.txt– Groups URLs per hostname / domain / IP– Limit the number of URLs for round of fetching– Default values are polite but can be made more aggressive
Crawl Strategy – Breadth-first but can be depth-first– Configurable via custom scoring plugins
Scoring– OPIC (On-line Page Importance Calculation) by default– LinkRank
11 / 30Features (cont.)
Protocols– Http, file, ftp, https
Scheduling– Specified or adaptative
URL filters– Regex, FSA, TLD domain, prefix, suffix
URL normalisers– Default, regex
12 / 30Features (cont.)
Other plugins– CreativeCommons– Feeds– Language Identification– Rel tags– Arbitrary Metadata
Indexing to SOLR– Bespoke schema
Parsing with Apache Tika– But some legacy parsers as well
13 / 30Outline
Overview Features Data Structures Use cases
What's new in Nutch 1.3 Nutch 2.0 GORA
Conclusion
14 / 30Data Structures
MapReduce jobs => I/O : Hadoop [Sequence|Map]Files CrawlDB => status of known pages
CrawlDB
MapFile : <Text,CrawlDatum> byte status; [fetched? Unfetched? Failed? Redir?] long fetchTime; byte retries; int fetchInterval; float score = 1.0f; byte[] signature = null; long modifiedTime; org.apache.hadoop.io.MapWritable metaData;
Input of : generate - index Output of : inject - update
15 / 30Data Structures 2
Segment/crawl_generate/ → SequenceFile<Text,CrawlDatum>/crawl_fetch/ → MapFile<Text,CrawlDatum>/content/ → MapFile<Text,Content>/crawl_parse/ → SequenceFile<Text,CrawlDatum>/parse_data/ → MapFile<Text,ParseData>/parse_text/ → MapFile<Text,ParseText>
Segment => round of fetching Identified by a timestamp
Can have multiple versions of a page in different segments
16 / 30Data Structures – 3
LinkDB
MapFile : <Text,Inlinks> Inlinks : HashSet <Inlink> Inlink :
String fromUrlString anchor
Output of : invertlinks Input of : SOLRIndex
linkDB => storage for Web Graph
17 / 30Outline
Overview Features Data Structures Use cases
What's new in Nutch 1.3 Nutch 2.0 GORA
Conclusion
18 / 30Use cases Crawl for Search Systems
– Web wide or vertical– Single node to large clusters– Legacy Lucene-based search or SOLR
… but not necessarily– NLP (e.g.Sentiment Analysis)– ML, Classification / Clustering– Data Mining
– MAHOUT / UIMA / GATE – Use Behemoth as glueware (http://github.com/jnioche/behemoth)
SimilarPages.com– Large cluster on Amazon EC2 (up to
400 nodes)– Fetched & parsed 3 billion pages– 10+ billion pages in crawlDB
(~100TB data)– 200+ million lists of similarities– No indexing / search involved
19 / 30Outline
Overview Features Data Structures Use cases
What's new in Nutch 1.3 Nutch 2.0 GORA
Conclusion
20 / 30NUTCH 1.3 Transition between 1.x and 2.0
http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/
1.3-RC3 => imminent
Removed Lucene-based indexing and search webapp
– delegate indexing / search remotely to SOLR
– change of focus : “Web search application” → “Crawler”
Removed deprecated parse plugins
– delegate most parsing to Tika
Separate local / distributed runtimes
Ivy-based dependency management
21 / 30NUTCH 2.0
Became trunk in 2010
Same features as 1.3– delegation to SOLR, TIKA, etc...
Moved to table-based architecture– Wealth of NoSQL projects in last 2 years
Preliminary version known as NutchBase (Doğacan Güney)
Moved storage layer to subproject in incubator → GORA
22 / 30GORA
http://incubator.apache.org/gora/
ORM for NoSQL databases– and limited SQL support
Serialization with Apache AVRO
Object-to-datastore mappings (backend-specific)
Backend implementations– HBase– Cassandra– SQL– Memory
0.1 released in April 2011
23 / 30GORA (cont.)
Atomic operations– Get – Put– Delete
Querying– Execute– deleteByQuery
Wrappers for Apache Hadoop– GORAInput|OutputFormat– GORAMapper|Reducer
24 / 30Benefits for Nutch
Storage still distributed and replicated
but one big table– status, metadata, content, text → one place
Simplified logic in Nutch– Simpler code for updating / merging information
More efficient– No need to read / write entire structure to update records
– e.g. update step in 1.x
Easier interaction with other resources– Third-party code just need to use GORA and schema
25 / 30Status Nutch 2.0
Beta stage
– debugging / testing required
Compare performance of GORA backends
Need to update documentation / WIKI
Enthusiasm from community
GORA – next great project coming out of Nutch?
26 / 30Future
Delegate code to crawler-commons(http://code.google.com/p/crawler-commons/)
– Fetcher / protocol handling– Robots.txt parsing– URL normalisation / filtering
New functionalities – Sitemap– Canonical tag– More indexers (e.g. ElasticSearch) + pluggable indexers?
Definitive move to 2.0?– Contribute backends and functionalities to GORA
27 / 30Outline
Overview Features Data Structures Use cases
What's new in Nutch 1.3 Nutch 2.0 GORA
Conclusion
28 / 30Where to find out more?
Project page : http://nutch.apache.org/ Wiki : http://wiki.apache.org/nutch/ Mailing lists :
– [email protected]– [email protected]
Chapter in 'Hadoop the Definitive Guide' (T. White)– Understanding Hadoop is essential anyway...
Support / consulting : – http://wiki.apache.org/nutch/Support– [email protected]
29 / 30Questions
?
30 / 30