cs246 search engine scale. junghoo "john" cho (ucla computer science) 2 high-level...

CS246

Search Engine Scale

Junghoo "John" Cho (UCLA Computer Science)

2

High-Level Architecture Major modules for a search engine?1. Crawler

Page download & Refresh

2. Indexer Index construction PageRank computation

3. Query Processor Page ranking Query logging


3

General Architecture


4

Scale of Web Search Number of pages indexed:

~ 10B pages Index refresh interval:

Once per month ~ 1200 pages/sec Number of queries per day:

290M in Dec 2008 ~ 3000 queries/sec Services often run on commodity Intel-Linux

boxes


5

Other Statistics Average page size: 15KB Average query size: 40B Average result size: 5KB Average number of links per page: 10


6

Size of Dataset (1) Total raw HTML data size

10G x 15KB = 150 TB! Inverted index roughly the same size as raw

corpus: 150 TB for index itself With appropriate compression, 3:1

compression ratio100 TB data residing in disk


7

Size of Dataset (2) Number of disks necessary for one copy (100 TB) / (1TB per disk) = 100 disk


8

Data Size and Crawling Efficient crawl is very important

1 page/sec 1200 machines just for crawling Parallelization through thread/event queue necessary Complex crawling algorithm -- No, No!

Well-optimized crawler ~ 100 pages/sec (10 ms/page) ~ 12 machines for crawling

Bandwidth consumption 1200 x 15KB x 8bit ~ 150Mbps One dedicated OC3 line (155Mbps) for crawling

~ $400,000 per year


9

Data Size and Indexing Efficient Indexing is very important

1200 pages / sec Indexing steps

Load page, extract words – Network/disk intensive Sort word, postings – CPU intensive Write sorted postings – Disk intensive

Pipeline indexing steps

L S WL S W

L S W

P1P2

P3


10

Simplified Indexing Model

Model Copy 50TB data from disks in S

to disks in D through network S: crawling machines D: indexing machines 50TB crawled data, 50TB index Ignore actual processing

1TB disk per machine ~50 machines in S and D each 8GB RAM per machine

::

S D


11

Data Flow

Disk RAM Network RAM Disk No hardware is error free

Disk (undetected) error rate ~ 1 per 1013

Network error rate ~ 1 bit per 1012

Memory soft error rate ~ 1 bit error per month (1GB) Typically go unnoticed for small data

Disk RAM

Network

RAM Disk


12

Data Flow

Assuming 1Gbit/s link between machines 1TB per machine, 30MB/s transfer rate

Half day just for data transfer

Disk RAM

Network

RAM Disk


13

Errors from Disk

Undetected disk error rate ~ 1 per 1013 5x1013 bytes data read in total

5X1013 bytes data write in total 10 byte errors from disk read/write

Disk RAM

Network

RAM Disk


14

Errors from Memory

1 bit error per month per 1GB 100 machines with 8GB each

8*100 bit errors/month 15 bit error per half day 15 byte error from memory corruption

Disk RAM

Network

RAM Disk


15

Errors from Network

1 error per 1012

5 x 8 x 1013 bits transfer 400 bit errors scatters around the stream 400 byte errors

Disk RAM

Network

RAM Disk


16

Data Size and Errors (1) During index construction/copy, something always goes wrong 400 byte errors from network, 30 byte errors from

disk, 15 byte errors from memory Very difficult to trace and debug

Particularly disk and memory error No OS/application assumes such errors yet

Pure hardware errors, but very difficult to differentiate hardware error and software bug Software bugs may also cause similar errors


17

Data Size and Errors (2) Very difficult to trace and debug

Data corruption in the middle of, say, sorting completely screws up the sorting

Need a data-verification step after every operation

Algorithm, data structure must be resilient to data corruption Check points, etc.

ECC RAM is a must Can detect most of 1 bit errors


18

Data Size and Reliability Disk mean time to failure ~ 3 years

(3 x 365 days) / 100 disks ~ 10 dayOne disk failure every 10 days

Remember, this is just for one copy Data organization should be very resilient to disk

failure


19

Data Size and Query Processing Index size: 50TB 50 disks Potentially 50-machine cluster to answer a

query If one machine goes down, the cluster goes down

Multi-tier index structure can be helpful Tier 1: Popular (high PageRank) page index Tier 2: Less popular page index Most queries can be answered by tier-1 cluster (with

fewer machines)


20

Implication of Query Load 3000 queries / sec Rule of thumb: 1 query / sec per CPU

Depends on number of disks, memory size, etc. ~ 3000 machines just to answer queries

5KB / answer page 3000 x 5KB x 8bit ~ 120 Mbps Half dedicated OC3 line (155Mbps) ~ $300,000


21

Hardware at Google ~10K Intel-Linux cluster Assuming 99.9% uptime (8 hour downtime per

year) 10 machines are always down Nightmare for system administrators

Assuming 3-year hardware replacement Set up, replace and dump 10 machines every day Heterogeneity is unavoidable

Position Requirements:

• Able to lift/move 20-30 lbs equipment on a daily basis.

Job posting at Google

cs246 search engine scale. junghoo "john" cho (ucla computer science) 2 high-level...

Documents

tb disk

tb data residing

tb crawled data

year slide

sd slide

tb index

bytes data

network error rate