i/o-algorithms lars arge january 31, 2005. lars arge i/o-algorithms 2 random access machine model...

13
I/O-Algorithms Lars Arge January 31, 2005

Post on 18-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

I/O-Algorithms

Lars Arge

January 31, 2005

Lars Arge

I/O-algorithms

2

Random Access Machine Model

• Standard theoretical model of computation:

– Infinite memory

– Uniform access cost

R

A

M

Lars Arge

I/O-algorithms

3

Hierarchical Memory

• Modern machines have complicated memory hierarchy

– Levels get larger and slower further away from CPU

– Levels have different associativity and replacement strategies

– Large access time amortized using block transfer between levels

• Bottleneck often transfers between largest memory levels in use

L

1

L

2

R

A

M

Lars Arge

I/O-algorithms

4

I/O-Bottleneck• I/O is often bottleneck when handling massive datasets

– Disk access is 106 times slower than main memory access

– Large transfer block size (typically 8-16 Kbytes)

• Important to obtain “locality of reference”

– Need to store and access data to take advantage of blocks

track

magnetic surface

read/write armread/write head

Lars Arge

I/O-algorithms

5

Massive Data• Massive datasets are being collected everywhere• Storage management software is billion-$ industry

Examples:

• Phone: AT&T 20TB phone call database, wireless tracking

• Consumer: WalMart 70TB database, buying patterns

• WEB: Web crawl of 200M pages and 2000M links, Akamai stores 7 billion clicks per day

• Geography: NASA satellites generate 1.2TB per day

Lars Arge

I/O-algorithms

6

Example: Grid Terrain DataAppalachian Mountains (800km x 800km)

• 500MB at 100m resolution

• 5.5GB at 30m resolution

– NASA SRTM mission acquired 30m data for

80% of the earth land mass

• 50GB at 10m resolution (some of US available from USGS)

• 5TB at 1m resolution

Lars Arge

I/O-algorithms

7

I/O-Model

• Parameters

N = # elements in problem instance

B = # elements that fits in disk block

M = # elements that fits in main memory

K = # output size in searching problem

• We often assume that M>B2

• I/O: Movement of block between memory and disk

D

P

M

Block I/O

Lars Arge

I/O-algorithms

8

List Ranking

• Trivial internal memory algorithm takes O(N) time

– and causes O(N) page faults in external memory

• O(N/B) is the number of I/Os we need to read N element

– Difference between N and N/B is extremely important in practice

• Can we develop O(N/B) algorithm?

– Answer is NO

CBA D E G HF

B=M/B=2

Lars Arge

I/O-algorithms

9

Fundamental Bounds [AV88] Internal External

• Scanning: N

• Sorting: N log N

• Permuting

– List rank

• Searching:

• Note:

– Permuting not linear

– Permuting and sorting bounds are equal in all practical cases

– B factor VERY important:

– Cannot sort optimally with search tree

NBlog

BN

BN

BMlog

BN

NBN

BN

BN

BM log

}log,min{BN

BN

BMNN

N2log

Lars Arge

I/O-algorithms

10

Sorting• Merge sort:

– Create N/M memory sized sorted runs

– Merge runs together M/B at a time

phases using I/Os each)( BNO)(log

MN

BMO

Lars Arge

I/O-algorithms

11

Distribution Sort

Lars Arge

I/O-algorithms

12

Finding partition elements

Lars Arge

I/O-algorithms

13