file structures course: 03-60-415 dr. joan morrissey school of computer science university of...

File Structures Course: 03-60-415

Dr. Joan MorrisseySchool of Computer Science

University of WindsorWindsor, Canada

2

What are file structures and why we need them.

A data structure is a method of organizing data in RAM/main memory.

Examples are: arrays, a stack, a queue, a binary tree, an avl-tree etc.

A file structure is a method of organizing data on secondary storage

devices (SSDs) such a hard disk, tape or cd-rom etc. Examples are: a

simple index, secondary indexes, the family of B-trees and hashed files.

NO commercial DB is going to fit into RAM and RAM is volatile.

File structures are used to minimize accesses to SSDs so data can be

retrieved quickly.

Why SSDs?

RAM capacity limited for very large databases which may be held on

several hard disks.

RAM is volatile but we need permanent storage for our DB.

Cheaper for backup and archiving and distribution of software, games etc,.

3

Hard Disks.

Most common method of secondary storage for DBs.

However, disks are very slow since we have physical parts to move –

unlike RAM.

Must use buffering techniques to load sector(s) into RAM and back to disk

– also slows down access.

We can have zone sectoring, that can store the maximum amount of data or

“hard” sectors which all contain same amount of data - More later !

What does a basic disk look like?

4

A disk….

5

An example of a disk with 7 cylinders….

6

Tracks, cylinders and sectors with “hard” sectors.

A cylinder on a hard sectored disk is all of tracks X. For example cylinder

200 consists of all tracks 200 on each side of each platter.

If you can get a file on a single cylinder (or contiguous cylinders) then this

speeds up access to data as you don’t have to move the R/W heads.

7

Hard and Zoned sectors.

With hard sectors, each sector contains the same amount of data. Wasted space at

the edge of platters!

With zoned sectors, each “sector” holds a number of sectors. With this we get

better used of space and have cylinders. But would have some waste of storage.

Disk moves at a constant speed in both cases. Different to CD.

8

Advantages….and disadvantages of disks.

Fast (slower than Ram) & cheap. Non-volatile.

With hard sectoring, tracks are organized into concentric circles and

divided into sectors, each containing the same amount of data. Have same

number of sectors on each track. Causes waste of space as sectors on

outside of platter are not packed maximally with data.

Zone recording more efficient as we can store more data on each track.

Always have to consider how to minimize accesses to disk (seeks).

Seek = time to move to right cylinder, rotational delay so that required

sector is under r/w head and head settle time.

Sector is the least amount of data that can be written or read. Don’t

want BIG sectors because this causes problems with buffering to and from

RAM.

Random and sequential access possible. More later!

9

Other types of SSDs.

Magnetic tape:

As of 2014, the highest capacity tape cartridges can store 185 TB of data- Sony.

Price?? In 2013 4TB was $30K…but will get lower

Only can have sequential access.

Usually very robust. Good for backup and storage.

Nine parallel tracks (horizontal) on tape. One bit slice = byte + parity bit.

Can have even or odd parity. If using even then parity bit is set to make number of

ones even - ensures correctness of data.

Format:Load File

Load point marker

File info

File header

IBG Data Etc… Nextfile

Etc…EOF

End oftape

10

CDs – not used much except for s/w delivery. Can be read only or read/write. 700 MB (approx) for single density.

Cheap but very slow for random access to data. Identify sector by minute:second

so random access is by “trial and error”.

Not robust. Easy to scratch. Dust can be a problem. Degrades over time.

Single spiral track consisting of pits and lands. Pits are nms deep – space between

is a land.

Read by red laser beam. Change in light intensity from moving from a pit/land or

land/pit = 1.

Speed up as you read towards centre to ensure that the same amount of data

goes past the read/write head in a given period of time. Important with data

but not with music!

Double density possible. Narrower tracks or two layers of material on top of one

another where laser can read through top layer. Latter basically a DVD!

11

Flash drives or USB drives.

Usually 32/64 GB storage but more available for a price !1TB is already available for

$2000 - with plans for larger capacity. Prices will get lower.

Fairly robust. Human error often leads to them breaking or being lost.

Very fast as there are no moving parts.

Limited number of r/w cycles before degrading. But usually get lost before that

The term drive persists because computers read and write flash drive data using the

same system commands - seen as just another drive.

Nonvolatile storage.

Very convenient!

Also…External hard drives.. important for safe backup. Cloud storage - becoming

cheaper! Solid state disks – non volatile storage with no moving parts. Buffering is

faster as is seek time. Degrades with time. 250GB for $100 + . Hard drive: 4TB $200 !

12

Indexing… important as it gives faster access to data.

A simple index for fixed length records:

Data File: consists of entry sequenced records (not sorted). Records held

on disk but only PK, name, shown. Data always appended at EOF.

RR DataRRN

0

1

2

3

Data file

Jones (primary key) record…………..

Burke…………….

Adams…………………..

Smith………………………….

13

Primary index for fixed length records.

Primary Index: consists of PK and RRN (gives the position of a record

relative to the beginning of the file – assumes fixed length records). It is

sorted and fixed length. (For data in previous slide)

PPK

2

1

0

3

PK

Adams

RRN

Burke

Jones

Smith

14

How do we find records in the data file?

Load the primary index into RAM from the disk!

Do a binary search of the PK.

Pick up corresponding RRN.

Seek to RRN on disk and buffer record into RAM.

Note that the first RRN is always 0 and only used with fixed length

records. RRN × record length = byte offset.

Note also that data is always added to the end of the data file on the disk,

faster!

15

A simple index for variable length records:

Data File: Consists of entry sequenced records (not sorted).

(500) Burke record……….. King record……Brown record

Hall record ………….Wang record………………………….

Liu record ………….Etc……….

(0) Jones record……(200)Adams record………(350)Smith……..

16

Primary index for variable length records for data on last slide.

Primary Index: consists of PK and the byte offset (from start of file – byte 0).

It is sorted (by PK) and fixed length.

Primary

Adams 200

Burke 500

Jones 0

Smith 350

Primary Key Byte offset

17

Finding a variable length record.

For example, retrieve the Smith record?

1. Load primary index into RAM

2. Use a binary search to find PK (Smith) in the primary index − can do so

because it is fixed length.

3. Retrieve corresponding byte offset.

4. Seek to record in data file using byte offset.

5. Read the Smith record.

18

Note:

We assume that the simple index can fit in RAM but not the data file.

A “seek” is an access to a secondary storage device.

Can’t use RRN with variable length records.

What if the primary index is too large to fit into RAM? First, consider

using secondary keys to access records: for example, get all the records

where the supplier is in Paris? Note that secondary keys do not have to

be unique.

Note: in the examples shown the data file consists of records where the PK

is name but there is also other information in the record.

19

Secondary indexes – work with a primary key file.

The secondary index is fixed length and sorted so that it can fit into RAM all at

once.

The primary key file is fixed length and entry sequenced – i.e. data added at EOF.

The data file is never sorted and never fits into RAM.

“Next RRN” in the Primary Key File is simply a linked list.

The Primary Index, the Secondary index and the Primary Key File are all needed

to access the data file. But don’t need them all in RAM at once. Only needed data

is buffered into RAM.

20

Data file for secondary index & primary key file.

Note that the data is not sorted. Data always added at end of file for

efficiency. First record is RRN 0, and so on. RRN is not part of the file.

Look at constructing a secondary index based on city......

O

0 Burke …… Athens …

1 Jones …… Paris ``````

2 Smith …… Athens `````

3 Hall … London …

4 Wang …… Athens ``````

5 Kehoe …… London ………

6 Black … Paris …

7 Liu … Paris ……..

8 Frost …… Rome ………

Name data City DataRRN

21

Secondary Index based on City & primary key file.

SEondary

SK First RRN

Athens 0

London 3

Paris 1

Rome 8

Secondary index Pimary

RRN PK Next RRN

0 Burke 2

1 Jones 6

2 Smith 4

3 Hall 5

4 Wang -1

5 Kehoe -1

6 Black 7

7 Liu -1

8 Frost -1

Primary Key File

22

Finding Records. How do we retrieve all the records where the supplier is in Paris?

1. We do a binary search of the secondary index to find the SK “Paris”

2. We retrieve the corresponding first RRN, 1.

3. We seek to RRN 1 in the primary key file and get the PK “Jones” – the first supplier in

Paris.

4. We follow the linked list using Next RRN to pick up Black (6) and Liu (7) – the other

suppliers in Paris.

5. We now have 3 PKs – Jones, Black and Liu.

6. The -1 indicates the end of the linked list.

7. With each of the PKs we do a binary search of the Primary Index, using PK, and pick

up the corresponding byte offset or RRN to find data in the data file.

8. Finally we retrieve the records from the data file on disk using RRN or offset.

23

B-Trees.

A B-tree is simply a large index. It will never fit in RAM – only parts of the

B-tree will fit – and the aim is to reduce seeks while it is being used. A seek

involves the physical movement of a read/write head on the device and thus

is very slow in comparison to RAM.

The basic unit of the B-tree is the node – conceptually an ordered sequence

of keys, references (RRN or byte offset) and pointers.

For example: an order 7 B-tree node has 6 keys, 6 corresponding references

and 7 pointers. The order is the maximum number of pointers that the node

can have. Note: references left out for clarity. Each pointer points to

another node in the B-tree. Very efficient. For example, take K. The pointer

on the left points to the node containing keys that are greater than F but

less than K. A↓ A ↓ F ↓ K ↓ O ↓ U Z↓ ↓

24

Example of a complete (small) B-tree: order 4

NN

DD H K Q S W

A B C E F G I J L M O P R T U V X Y Z

Contains all the letters of the alphabet, loaded in the (random) order:

C, S, D, T, A, M, P, I, B, W, N, G, U, K, E, H, O, L, J, Y, Q, Z, F, X, V

← Root node 7

Leaf nodes – no pointers↑

←Node 6

↑ Node 9

Node numbers come from how the tree is built – left out to simplify diagram.

25

Finding a Record.

For example, find the record with key “Z”

• Load root (node 7) into RAM. N is less than Z so follow right pointer to node 6.

• Root node is always loaded into RAM first. May even keep it there while we are

using the tree if we have space – improves efficiency.

• Load node 6 into RAM. Do binary search of node and follow pointer to right of W, to

node 9, since Z is greater than W.

• Load node 9 into Ram and do binary search to find Z. Follow reference (pointer to

disk) to find “Z” records.

• Load Z block into RAM. Note that we have a “modified” binary search.

B-trees very powerful when node size is large. For example, with node size 512 can

access > 134 million records with a maximum of 3 seeks.

26

When to use a B-tree.

When you need very fast indexed access to records.

Objective is to keep the B-tree as shallow (lowest possible number of levels) as

possible. Achieved by increasing node size – which is fixed for a particular tree. For

example, 512 (or more) pointers in node.

Limited only by what size node can fit in RAM and time needed to do binary search of

node in RAM.

Note that nodes need not necessarily (and probably won’t) be full – depends on the

order of the addition of records. But MUST – as a property – be at least half full.

The root (the top of the tree) can have as little as two pointers. The leaves have no

pointers.

You must have a primary key and a reference (RRN or byte offset). Pointer will be to a

cylinder, tract and sector number (on a hard disk) – pointing to where the next node is

located.

27

Advantages:

Can find the information to get the record at any level in the tree. (Not true

of B+-Trees)

Can, if needed, be used for sequential access to data if you do an in-order

traversal of the tree and the date is stored in the tree. (B+-Tree is much

more efficient for this task).

Tree is always balanced – all leaves are always on the same level because

tree is built from leaf nodes to root – not the other way around.

28

B+-Trees: Indexed sequential access

Many applications need both indexed access (for example, through a B-

tree) and sequential (in order) access.

Example : student records. Indexed: print transcript for an individual

student. Sequential: update grades for all students registered in 60-415-01

Therefore, we need file structure which allows both (a) random (indexed)

access to a single record and (b) sequential access to all records by the

primary key.

Solution: B+-Trees

Problem: keep the records in physical order by key.

Do we sort the file every time we get a new record? No! Too expensive.

Solution: keep records in sorted blocks connected by a linked list so that

the blocks are logically kept in sorted order. Note that the blocks can be

anywhere on the disk but ideally close together.

29

The sequence set.

Each block contains records and is sorted by PK. Must be able to fit at least two blocks in RAM together to merge blocks and move records.

Advantage: never have to keep the file sorted - just the blocks

30

Disadvantages of sequence set:

Blocks may not be full – get internal fragmentation. Space wasted in the

file. However, must be at least half full.

Must maintain linked list as records are inserted and deleted – may cause

addition or deletion of blocks. Records are moved to keep blocks half full.

Records are not stored in physical order so more seeks may be necessary to

print in sorted order.

What’s a good block size?

Requires no more than one seek.

Must be able to fit two blocks in RAM ( + code) so that blocks can be

merged or split – caused by deletion or insertion of a record into a block.

31

How do we access the blocks?

We place a B-tree (the index set) on top of the blocks = B+-Tree.

The purpose of the B+-Tree is to locate a block of records which is then loaded into

RAM and (a) searched for the required record or (b) processed record by record in

order. Do (b) by following linked list of blocks.

The most common type of B+-Tree is the simple prefix B+-Tree but only used when

keys (separators) can be shortened.

We don’t use all keys in the B-Tree part. We use strings called separators to distinguish

between one block (of records) and another block. Use “separators” - not all keys.

We use the shortest possible string as the separator. This is what makes it “simple

prefix”.

Height balanced also – same as a B-Tree. All leaves at same level.

Also want to keep the B+-Tree as shallow as possible – easier as we use separators

rather than all keys as in a B-Tree.

32

Example of a simple prefix B+-Tree:

Remember that the sequence set is in logical but not physical order.

Sequence set↑

33

Properties of and when to use a B+-Tree.

Use when indexed and sequential access is needed to the records.

Always need to go to the leaf level to retrieve a block of records. Not true

of B-trees.

Separators rather than keys used – more efficient tree.

Usually shallower than a B-Tree.

Use simple prefix B+-Tree when keys will compress and space is a

problem. The cost is more complex structure and code.

However, sometimes we need really fast access to date. An example is

reading price code labels in a supermarket checkout. Any type of B-tree

would be too slow. Which brings us to ….

34

Hashing on a disk …..

The best method for really fast access to the records stored on a SSD is hashing.

Note that hashing on SSDs is done somewhat differently than hashing done in

RAM as the objective is to minimize disk accesses.

Advantages:

Direct access to the record as no index is used.

Save space since we have no index (simple, B-tree or B+-Tree).

Fast inserts and deletes to data file (file of records).

Average of less than 2 seeks to retrieve any record.

Disadvantages:

Can’t use with variable length records.

Very difficult to sort the data file.

Secondary keys are not possible with a simple hashed file.

35

What is hashing? A hash function, h(key), transforms the key into a home address, which is

an address on a secondary storage device, for example on a hard disk.

The addresses produced are “random”. That is, every address is equally

likely to occur using the hash function.

Two or more keys may hash to the same home address. This is called a

collision and the keys are called synonyms. We must have methods for

dealing with this problem.

36

Hashing – a simplified diagram.

Hash file

↑

37

Building the file & retrieving a record.

Assuming no collision resolution at this stage.

Set aside a number of addresses – always greater than the number of records (usually

approx twice as many – optimal reduction of collisions).

Apply the hash function to all keys – place records in addresses in data file on hard

disk.

Retrieving a record from the file:

Apply hash function to the supplied key (from query) and to get corresponding

address.

Seek to address and move record into RAM.

If you come to the end of the file then you simply start at the beginning of the file

and keep searching until you reach the point where you started. If that happens then

the record is not in the file!

38

Collisions happen!

Collisions must be resolved since (for the moment) we can’t have 2 records at the same

address.

Ideally: find the perfect hash function which never produces collisions – impossible!

Solution: Develop algorithms – called collision resolution methods, which will

minimize collisions.

Methods include:

Chose a hash function which will distribute the records at least randomly. In this case,

every address is just as likely to be produced.

Use extra addresses so that collisions are less likely. However, the cost is space which

will never be used – but get less seeks.

Use progressive overflow and chained progressive overflow.

Put more than one record at an address – the address is then called a bucket.

39

Progressive overflow.

To place a record: Apply hash function to PK to produce an address. If address is already in use (busy) then continue searching down in the file to find an empty slot to place the record corresponding to the PK.

To find a record: 1. Apply the hash function to get the home address.2. Perform sequential search from home address until record is found.3.What if you come to the end of the file? Wrap around to the first address in the file and continue search.

40

Progressive overflow…. continued

How do you know if a record is not in the file?

Stop searching when one of the following happens

You return to the home address.

You find an empty space - record would have been stored there if it was in

the file. (Another good reason for a low PD).

Advantage of progressive overflow: very simple to implement.

Disadvantage of progressive overflow:

Very slow because of sequential search to find empty slot – slow means

expensive!

Can cause clusters of “overflow” records and thereby increase the number

of seeks. Happens because you always use the next available empty slot

with this method. It does not spread out the records.

41

Chained progressive overflow - improvement.

It is a variation where we use a linked list of synomyns to reduce the number of seeks

. Hossain 5 0

Kehoe 1 1 Kehoe 4

Smith 5 2 Cole 3

Cole 2 3 Wang 8

Wang 2 4 Burke 11

Burke 1 5 Hossain 6

Jones 5 6 Smith 7

Black 3 7 Jones 10

Saha 3 8 Black 9

Liu 5 9 Saha -1

Ahmed 1 10 Lui -1

11 Ahmed -1

PK Home RRN Record Next RRN

42

Advantages & Disadvantages.

Advantage: reduced number of seeks. Look at finding Liu!

Disadvantages:

Still get clustering of records.

Linked list to maintain – makes inserts more complicated.

Can’t always get into the right linked list (of synonyms) by starting at the

home address of a record. Problem occurs when there is another record at

the home address already as a result of chained progressive overflow.

Solution is to do a sequential search to get the right record and thus into the

right linked list. For example, next slide *

Solution: have a primary data area (home addresses only) and a separate

overflow data area where synonyms are placed and linked by pointers.

* new

43

Chained progressive overflow - problem

.

Hossain 1 0

Kehoe 1 1 Hossain 2

Smith 5 2 Kehoe 6

Cole 2 3 Cole 4

Wang 2 4 Wang 8

Burke 1 5 Smith 7

Jones 5 6 Burke 11

Black 3 7 Jones 10

Saha 3 8 Black 9

Liu 5 9 Saha -1

Ahmed 1 10 Liu -1

11 Ahmad -1

PK Home RRN Record Next RRNWon’t be able to find Wang record by starting at home address 2 ! Huge problem.

44

Primary Data Area & Overflow Area Primary Data Area Overflow

RRN Record Next RRN Record Next 0 0 Kehoe 2 1 Hossain 0 1 Wang -1 2 Cole 1 2 Burke 6 3 Black 4 3 Jones 5 4 4 Saha -1 5 Smith 3 5 Liu -1 6 6 Ahmad 1-

45

Primary data area & overflow data area.

Advantage: can always find a record by starting at its home address – will always get into correct linked list.

Disadvantage: Now have two files to maintain – more overheads and more complicated code. Also have a linked list to maintain in the overflow area but it is smaller than in chained progressive overflow.

46

Other collision resolution methods.

Buckets: Store more than one record at each address. A bucket is usually one or two sectors or a

block on the disk. Don’t make too big as there is a trade off between bucket size and the time

required to buffer bucket into RAM.

The hash function now produces a home bucket address.

Still get some collisions – but far fewer. Can use progressive overflow to deal with the collisions

but clusters of buckets are rare.

Double Hashing: if a collision occurs then another hash function is applied to the key to give a

number X. X is then added to the home address to give the actual address (if this address is

occupied the X is added again).

Advantage: spreads out records making collisions (and clusters) less likely and reduces the

average number of seeks needed to find a record.

Disadvantage: removes locality – record may be placed on a different cylinder which will cause an

extra seek. So try to keep synonyms on the same cylinder.

47

How do we handle deletions of records from the file?

Two issues to consider:

1. Want to reuse the slot (space on disk).

2. Don’t want deletions to interfere with the search for a record in the file.

(remember we stop searching in progressive, and chained progressive,

overflow if we find an empty slot).

Solution: insert a special marker (called a tombstone) when we delete to

indicate that a record was there but has been deleted.

However, we do not put in a tombstone if the slot after it is empty as this

would increase the length of the search for a (non-existent) record.

48

File degradation.

Problem: performance deteriorates over time as records are added and deleted.

Specifically, tombstones could be occupied by overflow records – which make search

lengths longer than they need to be.

Solutions:

Reorganize (move records around) after a delete – expensive and complicated code.

Use a different collision resolution method. However, can still run into problems after

time has elapsed.

Rehash the file when the average search length (average number of seeks to find a

record) becomes unacceptable. Best solution !

file structures course: 03-60-415 dr. joan morrissey school of computer science university of...

Documents