managing data resources

58
1 Managing Data Resources

Upload: efuru

Post on 22-Feb-2016

55 views

Category:

Documents


0 download

DESCRIPTION

Managing Data Resources. 1. The Name of the Game. Information is a valuable resource. It is expensive to collect, maintain, and use. The goal of database management it to maximize the benefits gained from information maximize the accuracy of information - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Managing Data Resources

1

Managing Data Resources

Page 2: Managing Data Resources

2

The Name of the Game

• Information is a valuable resource.• It is expensive to collect, maintain, and use.• The goal of database management it to

– maximize the benefits gained from information• maximize the accuracy of information

– minimize the costs associated with information

Page 3: Managing Data Resources

3

Keeping Track of Things

• Entity - person, place, thing or event on which we maintain information.

• Attribute - A single piece of information describing a particular entity.

Page 4: Managing Data Resources

4

Data Hierarchy

• Database - a collection of related files• File - a collection of uniform records• Record - a collection of related fields• Field - a collection of bytes• Byte (& words)• Bit

Page 5: Managing Data Resources

5

Terminology

• Generic Database Spreadsheet

• ----- TableTable/Sheet• Entity Record Row• Attribute Field Column

Page 6: Managing Data Resources

6

Key Field(Attribute)

• A key field is an attribute that uniquely identifies a record in a file.– Examples: SSN, NAID

• The values in the key field MUST be unique.

• It is possible to use several fields to form a composite key.– Example: Lastname + firstname + middlename

Page 7: Managing Data Resources

7

Natural Keys

• It is convenient and desirable to use attributes which “naturally occur” with an entity as a key.

• Example - most students have a SSN by the time they enroll at NDSU, so the SSN would be natural key.

Page 8: Managing Data Resources

8

Accessing Information

• Lookup items(records) by the value of their key.

• Methods of access:– Sequential Access– Direct Access– Indexed Sequential Access

Page 9: Managing Data Resources

9

Ordered vs. Unordered

• A database file (collection of records) may be:

• ordered - physically arranged in the file so that the key field increases (or decreases) in a sequential fashion.

• unordered - physically arranged in the file so the key field has no ordered relation with the preceding or succeeding key.

Page 10: Managing Data Resources

10

Costs & Benefits of Ordering• “In general” a record can be found faster in

an ordered list than in an unordered list.– I’ll use the term file & list interchangeably.

• “In general” you can turn an unordered list into an ordered list by sorting.

• Sorting is a cost of keeping a list ordered.• In this course we will generally be dealing

with ordered lists.

Page 11: Managing Data Resources

11

Sequential Access

• Look at key of first record in file, • if not the target then look at next record, • if not the target then look at next record, …• If file has N records on average will have to

look at N/2 records to find a random target.• Question - Why not just “skip over” some

of the records?

Page 12: Managing Data Resources

12

Sequential Access

• An employee database might use SSN as the key field.

• If the target SSN is 540-12-3763, and• the first record SSN is 120-11-0007, then• how many records should you skip?• This is why sequential access has to look at

every record.

Page 13: Managing Data Resources

13

Sequential Access

• Historically data was stored on tapes.• Tapes store information sequentially and

“only” allow for sequential access.• DASD (disks drives) can also store files

sequentially. Files are written to the disk track-by-track, cylinder-by-cylinder in a “physically contiguous” fashion.

Page 14: Managing Data Resources

14

Direct Access

• Direct access means that given a value for the key attribute the system can move “directly” to the corresponding record without having to look at an intervening records in the file.

• Direct access requires that the system “know” the physical location of the target record on the disk.

Page 15: Managing Data Resources

15

Hashing Algorithms

• To find the physical location on the disk a computation is performed on the key value which yields a “unique” physical address for the corresponding record.

• Perfect hashing algorithms get you to a unique address.

• Imperfect algorithms may hash several keys to the same address.

Page 16: Managing Data Resources

16

Hashing Example

• Suppose that I were using SSN as the key and wanted to keep track of 100 entities.

• Select 101 (a prime number closest to the number of records) and divide this into the SSN.

• Remainder will always be a number between 0 and 100.

Page 17: Managing Data Resources

17

Hashing Example

• The remainder represents the disk address.– A remainder of 52 could represent – cylinder 5, surface 2

• If two or more SSNs have the same remainder (hash to the same address) this is called a collision. Essentially these records are then searched sequentially.

Page 18: Managing Data Resources

18

Direct Access Note

• The physical addresses in Direct Access have no relation to the sequential “order” of the keys.

• For any two adjacent sequential keys there is no guarantee about the relationship between their physical locations on the disk, they may not be “physically contiguous”.

Page 19: Managing Data Resources

19

Sequential vs. Direct Access

• Sequential Access – good when you want to process all records in

key order, next record is always ready to be read/written.

• Direct Access – good when you want to process records in a

random order, next record can be found directly.

Page 20: Managing Data Resources

20

Indexed Sequential Access Method (ISAM)

• Combines a sequential file with one or more levels of indexes.

• Each index relates a physical location to the highest key value stored in that location.

• You find physical location by looking in each level of the index and then sequentially searching the last physical location.

Page 21: Managing Data Resources

21

ISAM

• In the library the books are laid out sequentially by call number (the key).

• Look at floor index to determine the floor• Look at shelf index to determine the shelf• Sequentially search the shelf

Page 22: Managing Data Resources

22

ISAM

• ISAM tries to give the best of both worlds.• When you want to process items

sequentially you have an underlying sequential file.

• When you want direct access you go through the indexes to get close, then a “small” sequential search at end.

Page 23: Managing Data Resources

23

Traditional File Systems

• Also called:– flat file organization– data file approach

• Typically an organization or a department within an organization would develop their applications and associated data files in an independent fashion.

Page 24: Managing Data Resources

24

Problems with Traditional Files

• Data Redundancy– conflicting data

• Program-Data Dependence– lack of flexibility

• Lack of Data Sharing– no common names for attributes & entities

• Poor Security

Page 25: Managing Data Resources

25

DBMS Approach

• Database Management Systems approach places a common interface between the users of data (the application programs) and the data files.

Page 26: Managing Data Resources

26

DBMS Components

• Data Definition Language, DDL• Data Manipulation Language, DML

– Structured Query Language, SQL• Data Dictionary, DD

Page 27: Managing Data Resources

27

Logical & Physical Views

• Logical View– how the user sees the data

• Physical View– how the data is physically saved on the storage

media• The DBMS gives each user their own

logical view while storing the data using a single physical view.

Page 28: Managing Data Resources

28

Advantages of DBMS

• Complexity & Confusion reduced– all data stored in single centralized physical

view• Data redundancy & inconsistency reduced

– data dictionary shows what data elements are available, data element only present “once”

• Program-data dependence reduced– each user can get desired logical view

Page 29: Managing Data Resources

29

Advantages of DBMS

• Security– single point of access to data

• Reduced cost– initial purchase cost of DBMS and related staff

are high, but savings in future development and maintenance usually offset these costs

– Access & Flexibility– DML usually provides easier access to data

Page 30: Managing Data Resources

30

Designing Databases

• Hierarchical Data Model• Network Data Model• Relational Data Model

Page 31: Managing Data Resources

31

Hierarchical Data Model

Author 1

Book 1 Book 2 Book 3

Publisher A Publisher B Publisher A

Page 32: Managing Data Resources

32

Hierarchical Data Model

• Data records are broken into segments• Each segment contains some attributes• Segments are arranged into a hierarchical

“tree-like” structure• Physical locations pointers join related

segments into records• Child segments can only have one parent

Page 33: Managing Data Resources

33

Network Data Model

Author 1

Book 1 Book 2 Book 3

Publisher A Publisher B

Page 34: Managing Data Resources

34

Network Data Model

• Same organization as hierarchical data model

• Except that a child segment can have multiple parents

Page 35: Managing Data Resources

35

Relational Data Model

Author 1

Author 2

Author 3

Book 1

Book 2

Book 3

Book 4

Book 5Publisher 1

Publisher 2

Page 36: Managing Data Resources

36

Relating Fields

A1 Author 1

A2 Author 2

A3 Author 3

Book 1 A1 P1

Book 2 A3 P2

Book 3 A2 P2

Book 4 A1 P2

Book 5 A1 P1P1 Publisher 1

P2 Publisher 2

Page 37: Managing Data Resources

37

Relating Fields

A1 Author 1

A2 Author 2

A3 Author 3

Book 1 A1 P1

Book 2 A3 P2

Book 3 A2 P2

Book 4 A1 P2

Book 5 A1 P1P1 Publisher 1

P2 Publisher 2

Page 38: Managing Data Resources

38

Relational Data Model

ID Publisher

P1 Publisher 1

P2 Publisher 2

ID Author

A1 Author 1

A2 Author 2

A3 Author 3

Publisher-table

Author-table

Title AID PID

Book 1 A1 P1

Book 2 A3 P2

Book 3 A2 P2

Book 4 A1 P2

Book 5 A1 P1

Book-table

Page 39: Managing Data Resources

39

Relational Data Model• Data Records are broken into segments• Each segment contains some attributes• Segments are arranged in tables• There are NO “physical” location pointers

between tables• Relations between tables are “implied” by

relating fields

Page 40: Managing Data Resources

40

Relations Generated When Asked

• Relationships between segments are not predefined by pointers in the relational model.

• Tables are JOINed together to display relationships.

• JOINs occur at query time.• Tables must have a common data element to

be joined.

Page 41: Managing Data Resources

41

Example JOIN

Select Author, Title, Publisher

FROMAuthor-table, Book-table, Publisher-table

WHEREAuthor-table.ID = Book-table.AID, andBook-table.PID = Publisher-table.ID

Page 42: Managing Data Resources

42

Results of Join

Author Title PublisherAuthor 1 Book 1 Publisher 1Author 1 Book 4 Publisher 2Author 1 Book 5 Publisher 1Author 2 Book 3 Publisher 2Author 3 Book 4 Publisher 2

Answer-table

Page 43: Managing Data Resources

43

Relational Model Operations

• Selection– select which rows to display

• Projection– select which columns to display

• Join– combine two or more tables

Page 44: Managing Data Resources

44

Types of Relations

• 1-1– 1-to-1

• 1-n– 1-to-many

• n-n– many-to-many

Page 45: Managing Data Resources

45

Name of the game

• Using the relational model,• Represent each type of relationship

– as simply as possible (using the fewest tables),– with a minimum of duplicated data, and– with a minimum of wasted space (empty fields)

Page 46: Managing Data Resources

46

Tables needed for 1-1

Author TitleAuthor1 Book1Author2 Book2Author3 Book3

Book

Page 47: Managing Data Resources

47

Tables needed for 1-n

ID Name 1 Author1 2 Author2 3 Author3

AuthorID Title 1 Book1 1 Book2 2 Book3 3 Book4 2 Book5

Book

Page 48: Managing Data Resources

48

Tables needed for n-n

ID Name 1 Author1 2 Author2 3 Author3

AID BID 1 1 1 2 2 1 2 2 3 1 3 5 3 4 1 5 2 5

ID Title 1 Book1 2 Book2 3 Book3 4 Book4 5 Book5

AuthorBook

Writes

Page 49: Managing Data Resources

49

Advantages & Disadvantages

• Hierarchical & Network Data Models– faster for “pre-defined” queries– slower for ad-hoc queries– inflexible, more expensive to maintain

• Relational Data Models– flexible, less expensive to maintain– most queries require joins and are slower than

“pre-defined” queries mentioned above

Page 50: Managing Data Resources

50

Entity-relationship diagram• A conceptual model useful in database

design.• Illustrates the relationships between various

entities in the database.• Entities are represented by rectangles.• Relationships represented by diamonds.• Attributes can be assigned to both entities

and relationships.

Page 51: Managing Data Resources

51

ER-Diagram

Authors Bookswrite

Publishers

publish

n n

n

1

IDLast_NameFirst_NameMiddle_NameDOBDOD

NameAddressPhone

TitleDateEdition

Page 52: Managing Data Resources

52

Centralized Database

• All database files are stored on a central computer.

• All database processing is performed by the central computer.

• Problems– can overload central system– not very fault tolerant– communications costs can be high

Page 53: Managing Data Resources

53

Distributed Databases

• Distributed Processing– processing is performed locally by processors

connected by a communications network.• Distributed Databases

– the physical files that make up the database are stored in more than one location

Page 54: Managing Data Resources

54

Distributed Databases

• Duplicate Database– each location has its own copy of the entire

database.• Partitioned Database

– each location has a copy of the portion of the database that it needs.

Page 55: Managing Data Resources

55

Distributed Databases

• Central Index– Records are stored locally, but a centralized

index is maintained to quickly located any record.

• Ask-the-network– Records are stored locally and the network

must be polled each time a record is needed.

Page 56: Managing Data Resources

56

Data Warehousing

• A database with associated reporting and query tools,

• that stores current and historical data extracted from various operational systems

• and consolidated for management reporting and analysis.

Page 57: Managing Data Resources

57

A Data Warehouse...

• Sits on top of existing isolated legacy systems, “islands of information”, to provide an enterprise-wide database.

• Provides single platform, standardized access to current operational data and historical data (not normally maintained on legacy systems).

Page 58: Managing Data Resources

58

Obstacles to Database Implementation

• Organizational– structural changes– political changes

• Cost/benefit considerations• Placement of Data Management Function

– need data administration and planning at highest possible organizational level