file processing - indexing mvnc1 indexing jim skon

33
le Processing - Indexing MVNC 1 Indexing Jim Skon

Upload: brooke-mosley

Post on 16-Jan-2016

247 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 1

Indexing

Jim Skon

Page 2: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 2

Indexing

Index structures can greatly speed access Consider a library card catalog

» Allows quick access to books» Why not just order books by author name?

Actually three indexes:» Author» Topic» Title

Page 3: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 3

Indexing

Simple Index» Provides a shortcut, based on a key value, to

desired.» Each index based on a certain key(s) value» Can have indexs for any key field

Index File

Page 4: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 4

Indexing

Multiple Indexes» May have indexes for more then one field

Index File Index

Page 5: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 5

Indexing

Example: Record Albums» Record label» Record ID» Title» Composer(s)» Artisit(s)

Primary key: Record label + Record ID

Page 6: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 6

Indexing

Consider an index file which which contains records which contain:» Primary Key (Record label + Record ID)» Byte Offset

Index sorted in primary key order

Page 7: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 7

Operations in indexed file

Retrieving record» Search index file(perhaps using binary file)» Seek in main file to the byte offset specified in

index» Read record from main file

Page 8: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 8

Operations in indexed file

Create the empty index and data files Load the index file into memory Rewrite the index file after index change Add records to the file and index Delete records from data file Update records in data file

Page 9: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 9

Operations in indexed file

Create the empty index and data files» Create new files» Write header records indicating number of records

Page 10: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 10

Operations in indexed file

Load the index file into memory» Simply index index in sequential order, placing into

an array of (key,offset) structures» Since the records are small, could read several

records at once

Page 11: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 11

Operations in indexed file

Rewrite the index file after index change» Need only be done after index changes» Simply iterate through array, writing to index file» Can be done after EVERY change» Could wait until files are ready to be closed

– Need to keep track of whether file version is outof date

Page 12: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 12

Operations in indexed file

Add records to the file and index» Add record to main file

– Next free record– Maybe a linked list of “unused” records could be used to

keep track of available records.– Record order of main file unimportant

» Add record to index– requires moving down later records to keep file sorted– Could put at end, sorting occasionally.

Page 13: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 13

Operations in indexed file

Delete records from data file» Delete in main file

– Mark record– Perhaps link into list of free records

» Delete in index– Perhaps move every later record down one– Perhaps just mark as deleted

Could still search of key field still intact

Page 14: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 14

Operations in indexed file

Update records in data file» If change involves key field

– Will need to move entry in index– Can be thought of as a delete followed by an insert

» If change does not change key field– Case one - record does not move

just rewrite record index unchanged

– Case two - record changes position Perhaps the record in variable size, and it grows Index will have to changed to reflect new position Position of reference in index unchanged

Page 15: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 15

Indexes too large to keep in memory

Searching» Binary searching requires several reads» Not much better then searching a sorted complete

file

Updating» Indexing update can require rewritting much of the

file» Orders of magnitude more expensive then in

memory index management

Page 16: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 16

Indexes too large to keep in memory

In such cases consider» A hash file system» A tree-structured index (i.e. B-tree)

However, a file based index still has benefits» Allows binary searching on unordered file» Allows binary searching on variable length records» Indexes are smaller then main files, so somewhat

cheaper to manipulate» Allows file “rearrangement” without moving actual

records. (Consider when pinned)

Page 17: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 17

Indexing with multiple keys

Consider an additional index for access to album file by composer

Secondary index: fields» Composer» Offset into main file

Problem» Every time record moved in main file, ALL indexes

must change» The indexes pin the records!

Page 18: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 18

Indexing with multiple keys

Secondary index pinning - solution» Refer to primary kay rather then offset to actual

record» Now secondary key index doesn’t reference actual

records, records not pinned.» Main file can be reorganized without changing

secondary index

Page 19: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 19

Indexing with multiple keys

searching by secondary index» Search secondary index (binary search?)» If found, use associated primary key to look up

record in primary index» Use offset in primary index to lookup actual record

remember - the secondary key may contain multiple matches (E.g. Beethoven)» A secondary key can be thought of a refering to a

subset of records

Page 20: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 20

Indexing with multiple keys

Adding new records» Add record in main file and primary index as before» Add entry in primary in index» Add entry in secondary file

– As before, shift data as needed.– Duplicate keyed index entry stored together.– Duplicate’s should be stored in primary key order

Page 21: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 21

Indexing with multiple keys

Deleting records» remove entry from all secondary indexes

– Costly if many secondary indexes

» simply leave in secondary indexes– search in primary index will fail, indicating record not

available– Failed searches longer, but file management simpler

(faster)

Page 22: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 22

Indexing with multiple keys

Updating records» The fact that secondary indexes refer to primary

key insolates secondary indexes from most updates

– Records can move in main file without effecting secondary index

» Change in secondary key– If a secondary key value changes, then we must change

the key value in secondary index, requiring secondary index reordering

– Orther secondary indexes unchanged

Page 23: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 23

Indexing with multiple keys

Updating records» Change of primary key value

– All secondary indexes must be updated to refer to the new key value

– Since the secondary key is uncanged, no reorganization required in secondary indexes - just rewrite index entries in same spot

– Usually one index entry needs updating per secondary index.

– The main record itself will simplifying looking up associated reference in secondary index!

Page 24: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 24

Retrieval using combinations of secondary

keys

Consider:» Find all records with ID COL3345» Find all records of Beethoven’s work» Find all records of “Violin Concerto”

All require single index!

Page 25: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 25

Retrieval using combinations of secondary

keys

Now consider:» Find all records with composer = “Beethoven” and

title = “Symphony No. 9”. Method one:

» Search composer index for those matching Beethoven. This yields a list of primary keys.

» Next search title index for those matching “Symphony No. 9”. This also yields a list of primary keys.

» Now intersect the two primary key lists. This is a list of primary keys for record which match the query.

Page 26: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 26

Retrieval using combinations of secondary

keys

General Strategies» and queries: Intersect primary keys lists» or queries: Union primary keys lists

Point: Complex queries can be performed accessing only the matching records!

Page 27: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 27

Secondary index problems Consider problems with this secondary index structure:

» we have to rearrange the index file every time a new record is add!

– If we add anew version of Beethoven’s Symphony No. 9, we would have to add a new element to both the composer and the title indexes

» If there are duplicate secondary keys, the seconary key value is stored in the secondary index once for every record with the secondary key!

– Beethoven is stored in secondary index once for every Beethoven record in the main file.

– Waste of space!

Page 28: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 28

Inverted lists Solution one:

» Increase secondary index record size to include a list of all primary keys with matching values.

» Solves the two problems» Introduces problems:

– records must be large enough for maximum size list– Wastes space!

This is an Inverted List

Page 29: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 29

Inverted lists Solution Two:

» The Bible Index is a type of an Inverted List– Works ok since never updated– If updates needed, MANY records would have to be

moved

Page 30: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 30

Inverted lists Solution Three:

» Secondary index has:– A list of secondary keys (all unique)– Each entry contains a pointer to a list of primary key

references

» Now each key value stored exactly once» But how do we maintain the lists of primary key

references?

Solution - linked lists!

Page 31: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 31

Inverted lists Inverted lists with linked lists of references Two data structures

» A list of secondary keys, with pointers into a list of references

» A list if references, each with a (next) pointer, which refers to another reference in list, or null

Page 32: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 32

Inverted lists The secondary key list is no bigger then the

number of distinct secondary key values» Can be often stored in RAM» Lookups - binary search

The reference list can be stored in a file» Maintained as a linked list of free records» records added by delinked from free list, and linked

into the appropriate secondary key’s list.» record can be deleted by removing from the key’s

link listed and linked into a free list.

Page 33: File Processing - Indexing MVNC1 Indexing Jim Skon

File Processing - Indexing MVNC 33

Selective indexes

Consider a “special” index for Christain music The index(s) would only contain reference to

albums which are considered Christain.