csc 213 – large scale programming. today’s goals look at how dictionary s used in real world ...

32
LECTURE 21: INDEXED FILES CSC 213 – Large Scale Programming

Upload: leslie-harris

Post on 18-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

LECTURE 21:INDEXED FILES

CSC 213 – Large Scale Programming

Today’s Goals

Look at how Dictionarys used in real world Where this would occur & why they are

used there In real world setting, what problems can/do

occur Indexed file usage presented and

shown How & why we split index & data files Formatting of each file and how they get

used Describe what problems solved using

indexed files Java coding techniques that simplify using

these files Idea needed when using multiple

indexes shown

Dictionaries in Real World

Often need large database on many machines Split search terms across machines Updating & searching work split between

machines Database way too large for any single

machine If you think about it, this is incredibly

common Where?

Split Dictionaries

Split Dictionaries

Splitting Keys From Values

In real world, we often have many indices Simple units measure where we can find

values Values could be searched for in multiple

ways

Splitting Keys From Values

In real world, we often have many indices Simple units measure where we can find

values Values could be searched for in multiple

ways

Index & Data Files

Split information into two (or more) files Data file uses fixed-size records to store

data Index files contain search terms & data

locations Fixed-size records usually used in data

file Each record will use exactly that much

space Extra space wasted if the value is smaller But limits data size, cannot get more space Makes it far easier to reuse space &

rebuild index

Index File Format

No standard format – depends on type of data Often variable sized, but this not specific

requirement Each entry in index file begins with exact

search term Followed by position containing matching

data As a result, often find indexes smushed

together Can read indexes at start of program

execution Reasonably assumes index file smaller than

data file Changes written immediately, however

When program starts, do NOT read data file

Never Read Entire Data File

Indexed Files

Enables splitting search terms across computers Alphabetical split searches faster on many

serversA - C

D-E

F-HI-P

Q-R

S-T

U-X Y-Z

Indexed Files

Enables splitting search terms across computers Create indexes for different types of

searchingSong name

SongLength

How Does This Work?

Using index files simplified using positions Look in index structure to find position of

data in file With this position can then seek to specific

record Create instance & initialize by reading data

from file

Starting with Indexed Files

American Telephone & Telegraph 112

International Business Machines

0

Ford Motorcars, Inc. 224

IBM106

IBM

AT & T 23 T Ford 2 F

F 224

IBM 0

T 112

Where Was "Searching" Used?

Indexed files used in Maps and Dictionarys Read data into searchable object after

opening file For each record, Entry uses indexed data as

its key Single data file has multiple indexes to

search it Not a problem, each index has own Collection

Cannot have multiple instances for each data item

Cannot have single instance for each data item

Then how can we construct each Entry's value?

Proxy Pattern For The Win!

Proxy Pattern For The Win!

Create proxy instances to use as Entry's value Proxy pretends has data by defining getters

& setters Data's position & file only fields these

objects have Whenever method called looks up &

returns data Other classes will think proxy has fields

declared Simplifies using class & ensures up-to-date

data used But little memory needed, since data

resides on disk!

Starting with Indexed Files

American Telephone & Telegraph 112

International Business Machines

0

Ford Motorcars, Inc. 224

IBM106

IBM

AT & T 23 T

F 224

IBM 0

T 112

Ford 12 F

Coding

public class Stock {private static final int NAME_OFF = 0;private static final int NAME_SZ = 50;private static final int PRC_OFF=NAME_OFF + NAME_SZ;private static final int PRC_SZ = 4;private static final int TICK_OFF = PRC_OFF + PRC_SZ;private static final int TICK_SZ = 6;private static final int SIZE = TICK_OFF + TICK_SZ;

private long position;private RandomAccessFile theFile;

public Stock(long pos, RandomAccessFile file) { position = pos; theFile = file;}

Coding

public class Stock {private static final int NAME_OFF = 0; private static final int NAME_SZ = 50;private static final int PRC_OFF=NAME_OFF + NAME_SZ;private static final int PRC_SZ = 4;private static final int TICK_OFF = PRC_OFF + PRC_SZ;private static final int TICK_SZ = 6;private static final int SIZE = TICK_OFF + TICK_SZ;

private long position;private RandomAccessFile theFile;

public Stock(long pos, RandomAccessFile file) { position = pos; theFile = file;}

Fixed max. sizeof each field

Fixed size of a record in data file

Coding

public class Stock {private static final int NAME_OFF = 0;private static final int NAME_SZ = 50;private static final int PRC_OFF=NAME_OFF + NAME_SZ;private static final int PRC_SZ = 4;private static final int TICK_OFF = PRC_OFF + PRC_SZ;private static final int TICK_SZ = 6;private static final int SIZE = TICK_OFF + TICK_SZ;

private long position;private RandomAccessFile theFile;

public Stock(long pos, RandomAccessFile file) { position = pos; theFile = file;}

Offset in recordto field start

Coding

public class Stock { // Continues from last time

public int getStockPrice() { theFile.seek(position + PRC_OFF); return theFile.readInt();}public void setStockPrice(int price) { theFile.seek(position + PRC_OFF); theFile.writeInt(price);}public void setTickerSymbol(String sym) { theFile.seek(position + TICK_OFFSET); theFile.writeUTF(sym);}// More getters & setters from here…

Visualizing Indexed Files

American Telephone & Telegraph 112

International Business Machines

0

Ford Motorcars, Inc. 224

F 224

IBM 0

T 112

IBM106

IBM

AT & T 23 T Ford 12 F

How Do We Add Data?

Adding new records takes only a few steps Add space for record with setLength on

data file Update index structure(s) to include new

record Records in data file updated at each

change

Adding New Data To The Files

C 336

F 224

IBM 0

T 112

0 Ø

American Telephone & Telegraph 112

Citibank 336

International Business Machines

0

Ford Motorcars, Inc. 224

IBM106

IBM

AT & T 23 T Ford 12 F

Adding New Data To The Files

C 336

F 224

IBM 0

T 112

Citibank -2 C

American Telephone & Telegraph 112

Citibank 336

International Business Machines

0

Ford Motorcars, Inc. 224

IBM106

IBM

AT & T 23 T Ford 12 F

How Does This Work?

Removing records even easier To prevent using record, remove items from

indexes Do NOT update index file(s) until program

completes Use impossible magic numbers for record in

data file

Removing Data As We Go

C 336

F 224

IBM 0

T 112

American Telephone & Telegraph 112

Citibank 336

International Business Machines

0

Ford Motorcars, Inc. 224

Citibank -2 CIBM106

IBM

AT & T 23 T Ford 12 F

Removing Data As We Go

C 336

IBM 0

T 112

American Telephone & Telegraph 112

Citibank 336

International Business Machines

0

Citibank -2 CIBM106

IBM

AT & T 23 T 0 Ø

Using Multiple Indexes

Multiple indexes for data file very often needed Provides many ways of searching for

important data Since file read individually could also create

problem Multiple proxy instances for data could

be created Duplicates of instance are created for each

index Makes removing them all difficult, since not

linked Very easy to solve: use Map while loading

index Converts positions in file to proxy instances

to solve this

Linking Multiple Indexes

Use one Map instance while reading all indexes For each position in file, check if already in Map

Use existing proxy instance, if position already in Map

If a search in Map returns null, create new instance

Make sure to call put() when we must create proxy

For Next Lecture

Angel now has week #9 assignment (due 3/20) This is after break, but might want to get start now

Angel will also have project #2 available Has staggered submissions like previous project Based upon index files, so can start working now!

Will discuss implementing space efficient BST Start coloring nodes red & black Keeps balanced, but limits amount of movement