database management manual 2010

123
© Gaye Campbell 2010 1 DATABASE MANAGEMENT DATABASE MANAGEMENT DATABASE MANAGEMENT DATABASE MANAGEMENT CSYS2404 LECTURE NOTES © Mrs. Gaye Campbell 2010

Upload: marke-green

Post on 08-Apr-2015

775 views

Category:

Documents


10 download

TRANSCRIPT

Page 1: Database Management Manual 2010

© Gaye Campbell 2010 1

DATABASE MANAGEMENTDATABASE MANAGEMENTDATABASE MANAGEMENTDATABASE MANAGEMENT

CSYS2404

LECTURE NOTES

© Mrs. Gaye Campbell 2010

Page 2: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 2

TABLE OF CONTENTS

SYLLABUS/COURSE OUTLINE ........................................................................................................................................7

UNIT I – Introduction of Database Concepts ...........................................................................8

UNIT II – Database Design .....................................................................................................8

UNIT III – Introduction to Relational Algebra and SQL ..........................................................9

UNIT IV – Distributed Databases ............................................................................................9

UNIT V – Security Issues ...................................................................................................... 10

UNIT I: INTRODUCTION TO DATABASE CONCEPTS ............................................................................................... 12

The need for File Systems and Databases .............................................................................. 12

Basic Concepts ...................................................................................................................... 12

Sample Payroll Database Structure ................................................................................................ 14

The traditional/file oriented approach .................................................................................... 15

Problems with the Traditional approach ........................................................................................ 15

The database approach .......................................................................................................... 16

DBMS (Database management systems) ........................................................................................ 17

Functions common to most databases ........................................................................................... 18

Advantages of databases ........................................................................................................ 19

Disadvantages of databases ................................................................................................... 19

Components of a DBMS ....................................................................................................... 20

The different types of databases/Database Models ................................................................. 21

Hierarchical ................................................................................................................................... 21

Network ........................................................................................................................................ 23

Relational ...................................................................................................................................... 25

Object-Oriented ............................................................................................................................. 26

Object-Relational ........................................................................................................................... 31

Multidimensional ........................................................................................................................... 32

UNIT II: DATABASE DESIGN .......................................................................................................................................... 34

Introduction to the Database System Life Cycle (DBLC)....................................................... 34

Analysis and design phase .............................................................................................................. 34

Database implementation and operation phase ............................................................................. 34

Roles of database personnel ................................................................................................... 36

Data modellers .............................................................................................................................. 36

Business Analysts ........................................................................................................................... 36

Page 3: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 3

Database Designers ....................................................................................................................... 36

Systems Analysts [see Business Systems course] ............................................................................ 37

Programmers ................................................................................................................................. 37

Database Administrators ............................................................................................................... 38

Database Design – Conceptual, Logical, Physical .................................................................. 40

Conceptual design ......................................................................................................................... 41

Logical Design ................................................................................................................................ 41

Physical Design .............................................................................................................................. 41

Database Schema or Levels of abstraction in specifying a database structure ......................... 43

Definition of database schema ....................................................................................................... 43

Explanation of the four database schema ...................................................................................... 43

Entity- Relationship Diagrams ............................................................................................... 47

Types of relationships .................................................................................................................... 47

The symbols used in an ERD ........................................................................................................... 48

Sample ERDs .................................................................................................................................. 48

Example of Creating the ERD.......................................................................................................... 50

Entity and Referential Integrity ...................................................................................................... 51

ERD Exercises ................................................................................................................................. 52

Functional Dependencies ....................................................................................................... 53

Computation of Closures ....................................................................................................... 53

Algorithm for finding the closure of a set of attributes ................................................................... 54

Closure Exercises ........................................................................................................................... 54

Armstrong’s Axioms ............................................................................................................. 55

Reflexivity ...................................................................................................................................... 55

Augmentation ................................................................................................................................ 55

Transitivity ..................................................................................................................................... 55

Examples ....................................................................................................................................... 55

EXERCISE ....................................................................................................................................... 55

Covers and their role in determining redundant FDs .............................................................. 56

Algorithm to find redundant FDs. ................................................................................................... 56

Exercises - Find the redundant FDs in the following sets: ............................................................... 56

1st , 2nd , 3rd Normal Forms .................................................................................................... 57

Definition - A relation is in first normal form (1NF) if: ..................................................................... 57

Definition - A relation is in second normal form (2NF) if: ................................................................ 58

Page 4: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 4

Definition - A relation is in 3rd

normal form (3NF) if: ....................................................................... 58

Comprehensive example (1NF to 3NF) ........................................................................................... 59

Another example of the process. ................................................................................................... 61

Normalization Exercises to 3NF. ..................................................................................................... 63

Assessment of file layouts as they affect the functioning of a database. ................................. 65

Physical and logical data organization. .................................................................................. 65

UNIT III: INTRODUCTION TO RELATIONAL ALGEBRA AND SQL ....................................................................... 66

The languages used in database systems ................................................................................ 66

The role of Relational DMLs and DDLs. ............................................................................... 66

The difference between relational algebra and relational calculus. ......................................... 67

Relational algebra.................................................................................................................. 68

Simple projection ........................................................................................................................... 68

Selection ........................................................................................................................................ 68

Difference (or Set Difference) ........................................................................................................ 68

Renaming ...................................................................................................................................... 68

Union............................................................................................................................................. 68

Intersection ................................................................................................................................... 68

Division .......................................................................................................................................... 68

Join (natural, equi, inner, outer)..................................................................................................... 69

Cartesian product. ......................................................................................................................... 72

Relational Algebra Exercises ................................................................................................. 73

SQL Commands – LAB PORTION ....................................................................................... 76

Brief Summary of Commands......................................................................................................... 76

CREATE TABLE (using constraints – primary key, foreign key) ......................................................... 78

ALTER TABLE .................................................................................................................................. 80

INSERT ........................................................................................................................................... 81

SELECT (using WHERE, GROUP BY, ORDER BY, HAVING, aggregate functions, logical operators,

comparison operators) .................................................................................................................. 82

SELECT sub queries ........................................................................................................................ 86

Operations on Result Sets .............................................................................................................. 89

UPDATE ......................................................................................................................................... 91

DELETE .......................................................................................................................................... 92

CREATE VIEW ................................................................................................................................. 92

CREATE INDEX ............................................................................................................................... 93

Page 5: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 5

DROP TABLE .................................................................................................................................. 93

DROP VIEW .................................................................................................................................... 93

DROP INDEX .................................................................................................................................. 93

GRANT and REVOKE ....................................................................................................................... 93

COMMIT and ROLLBACK ................................................................................................................ 94

SQL EXERCISES ................................................................................................................. 95

EXERCISE 1 – CREATE TABLE AND ALTER TABLE STATEMENTS ........................................................ 95

EXERCISE 2 – INSERT, UPDATE, DELETE, SELECT USING UNION ....................................................... 95

EXERCISE 3 - SELECT STATEMENT ................................................................................................... 95

EXERCISE 4 - SELECT STATEMENT USING MORE THAN ONE TABLE ................................................. 96

EXERCISE 5 – DISTINCT, WILDCARD cont’d, SUB QUERY, CREATE INDEX, DROP TABLE, DROP INDEX

...................................................................................................................................................... 96

EXERCISE 6 – REVIEW OF ALL COMMANDS..................................................................................... 96

UNIT IV: DISTRIBUTED DATABASES .......................................................................................................................... 99

Characteristics of a distributed database ................................................................................ 99

Definition of logical database, local and global application, global intelligence ..................... 99

Assessment of a distributed database versus a loose connection of independent site ............ 100

Terms and concepts used in distributed databases ................................................................ 100

Advantages and disadvantages of a distributed database ...................................................... 101

Advantages .................................................................................................................................. 101

Disadvantages ............................................................................................................................. 102

Practice Questions ............................................................................................................... 103

Data warehouse ................................................................................................................... 104

Differences between data warehouse and operational database ............................................ 106

Data mart ............................................................................................................................ 108

On-line analytical processing............................................................................................... 109

Data mining ........................................................................................................................ 110

Transactions – Atomic, Consistent, Isolated, Durable (ACID) ............................................. 111

Concurrency control ............................................................................................................ 111

UNIT V: SECURITY ISSUES ......................................................................................................................................... 113

The role of the Data Dictionary ........................................................................................... 113

What is data security?.......................................................................................................... 113

What are Security Risks? ..................................................................................................... 113

Security risks and their effects ..................................................................................................... 114

Database protection methods - backup and restore methods ................................................. 116

Integrity Preservation – keys (primary and foreign), data validation, authority levels ........... 117

Keys ............................................................................................................................................. 117

Data Validation ............................................................................................................................ 117

Page 6: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 6

Authority Levels ........................................................................................................................... 118

Security Control – unauthorized access and use, encryption, anti-virus, firewall, SQL views ............................................................................................................................................ 118

SAMPLE SQL CODE FOR RECREATING DATABASE .............................................................................................. 121

REFERENCES .................................................................................................................................................................. 123

Page 7: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 7

SYLLABUS/COURSE OUTLINE

THE COUNCIL OF COMMUNITY COLLEGES OF JAMAICA

COURSE NAME: Database Management COURSE CODE: CSYS2404 CREDITS: 3 CONTACT HOURS: 45 (45 hours theory) PRE-REQUISITE(S): None CO-REQUISITE(S): None SEMESTER: COURSE DESCRIPTION: This course is designed to ensure that the student completes a study of Database Management Systems. Students will be exposed to database concepts including functional dependencies, SQL and normalization. Emphasis will be placed on the creation and manipulation of databases using Oracle, but this can be extended to any available DBMS. GENERAL OBJECTIVES: Upon successful completion of this course, students should:

1. understand various terms used in Database Management 2. appreciate the advantages of the database approach 3. understand key components of a database management system 4. appreciate the historical transformation of database models and DBMS 5. know the steps in the Database System Life Cycle 6. appreciate the differences between Logical and Physical Database Design and

organization 7. understand functional dependencies 8. understand how to normalize up to 3NF 9. use SQL commands 10. understand how to create reports using ad-hoc SQL commands 11. understand how to solve relational Algebra problems 12. understand distributed database concepts 13. appreciate the importance of maintaining data integrity and security 14. understand the application of Entity Relationship Diagrams

Page 8: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 8

UNIT I – Introduction of Database Concepts

Specific Objectives:

Upon successful completion of this unit, students should be able to:

1. define key terms associated with database management 2. discuss the file oriented versus the database approach 3. discuss advantages associated with database approach as opposed to file-oriented

approach 4. identify hardware, software and DBMS components 5. describe features of hierarchical, network, relational, object-oriented and object-relational

models Content:

1. Basic Concepts – character, field, record, table/file, database, Database Management System, primary key, foreign key, secondary key, composite key, super key, candidate key

2. The traditional/file oriented approach 3. The database approach 4. Advantages of databases 5. Components of a DBMS – DDL, DML, Query Language, Report Generator 6. The different types of databases – hierarchical, network, relational, object-oriented,

object-relational

UNIT II – Database Design Specific Objectives:

Upon successful completion of this unit, students should be able to:

1. define the Database System Life Cycle 2. identify the Phases in the Database System Life Cycle 3. identify the roles of database personnel 4. discuss conceptual, logical and physical data design 5. discuss the concept of database schema 6. utilize ERDs to capture data requirements 7. discuss concepts of entity and referential integrity 8. discuss Functional Dependencies (FDs) 9. find redundant FDs in a set 10. normalize to 3NF 11. assess file layouts as they affect the functioning of databases 12. discuss the differences between physical and logical data organization

Content:

Page 9: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 9

1. The Database Management System Life Cycle - Database Analysis, Database Design, Database Implementation, Database Testing and Evaluation, Operation, Database Maintenance

2. Roles of database personnel - Data modelers, Business Analysts, Database Designers, Systems Analysts, Programmers and Database Administrators.

3. Database Design – Conceptual, Logical, Physical 4. Database Schema 5. Entity- Relationship Diagrams 6. Entity and Referential Integrity 7. Functional Dependencies 8. Computation of Closures 9. Armstrong’s Axioms 10. Covers and their role in determining redundant FDs 11. 1st , 2nd , 3rd Normal Forms 12. Assessment of file layouts as they affect the functioning of a database. 13. Physical and logical data organization.

UNIT III – Introduction to Relational Algebra and SQL Specific Objectives:

Upon successful completion of this unit, students should be able to:

1. discuss and identify the role of Relational DMLs and DDLs 2. differentiate between relational algebra and relational calculus 3. solve Relational Algebra problems 4. utilize SQL commands

Content:

1. The role of Relational DMLs and DDLs. 2. The difference between relational algebra and relational calculus. 3. Introduction to Relational algebra – Simple projection, selection, difference, renaming,

union, intersection, division, join (natural, equi, inner, outer) and Cartesian product. 4. SQL Commands - CREATE TABLE (using constraints – primary key, foreign key),

ALTER TABLE, INSERT, SELECT (using WHERE, GROUP BY, ORDER BY, HAVING, aggregate functions, logical operators, comparison operators), SELECT sub queries, UPDATE, DELETE, CREATE VIEW, CREATE INDEX, DROP TABLE, DROP VIEW, DROP INDEX, GRANT and REVOKE, COMMIT and ROLLBACK.

UNIT IV – Distributed Databases Specific Objectives:

Upon successful completion of this unit, students should be able to:

Page 10: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 10

1. define characteristics of Distributed Databases 2. assessment of a distributed database versus a loose connection of independent sites 3. define terms and concepts used in the distributed database environment 4. identify advantages and disadvantages of distributed databases 5. discuss data warehousing 6. differentiate between a data warehouse and a data mart 7. differentiate between a data warehouse and an operational database 8. discuss On-line analytical processing (OLAP) 9. discuss the concept of data mining 10. discuss the concept of transactions and concurrency control

Content:

1. Characteristics of a distributed database 2. Definition of logical database, local and global application, global intelligence 3. Assessment of a distributed database versus a loose connection of independent site 4. Terms and concepts used in distributed databases – transparency, homogeneous versus

heterogeneous distribution, fragmentation – vertical/horizontal, replication, and allocation 5. Advantages and disadvantages of a distributed database 6. Data mart 7. Data warehouse 8. Differences between data warehouse and operational database 9. On-line analytical processing 10. Data mining 11. Transactions – Atomic, Consistent, Isolated, Durable (ACID) 12. Concurrency control

UNIT V – Security Issues Specific Objectives:

Upon successful completion of this unit, students should be able to:

1. identify the role of the Data Dictionary/ Directory 2. identify methods used in database protection 3. discuss methods used in integrity preservation 4. identify and discuss security control techniques

Content:

1. The role of the Data Dictionary 2. Database protection methods - backup and restore methods 3. Integrity Preservation – keys (primary and foreign), data validation, authority levels 4. Security Control – unauthorized access and use, encryption, anti-virus, firewall, SQL

views

Page 11: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 11

METHODS OF DELIVERY:

1. Lectures 2. Discussions 3. Lab

METHODS OF ASSESSMENT AND EVALUATION:

1. Common Coursework 20% 2. Internal Tests 20% 3. Final Examination 60%

RESOURCE MATERIAL:

Prescribed:

Hoffer, J.A., Prescott, M. & Topi, H. (2008) Modern database management. (9th ed.) . NJ: Prentice Hall.

Recommended:

Date, C. J. (2003) An introduction to database systems. (8th ed.). NJ: Addison Wesley. Shah, N. (2004) Database systems using oracle. (2nd ed.). NJ: Prentice Hall.

Page 12: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 12

UNIT I: INTRODUCTION TO DATABASE CONCEPTS

The need for File Systems and Databases

In order to be competitive in today’s data driven environment, business organizations have to be concerned with the concept of data management. Data management is the process of identifying effective and efficient methods of collecting, storing and retrieving data. Over the years, this need has given rise to the emergence of two distinct data management approaches: the file approach and the database approach.

Before we look at the differences between the file approach and the database approach we need to be aware of some basic file/database concepts.

Basic Concepts Term/Concept Definition

Data Raw facts which are important to an organization

Information Organized-data. This means that what is information for someone may be data for another.

Character One of a set of symbols, such as letters or numbers, that are arranged to express information and belongs to a character set (e.g. ASCII represented by 8 bits).

Field/Attribute/Column A single-unit of data in its simplest form. A field contains a specific piece of information within a record. A field name uniquely identifies each field. In the example employee table below the “lastname” field would contain all of the last names of the employees in the table. It is an attribute or characteristic of an entity.

Data type/Field type The physical representation of a data value. A data type is a unified set of data values that is integrated with a set of operations that allows the effective manipulation of each data value within the set. The data

type determines what kind of data may be stored in the field and it also determines the operations, which may be performed on the stored value.

Record/Row/Tuple A group of related fields. A record is defined as being a collection of related data. These data item (values) are often stored in fields. Each field is allowed to hold an atomic value, that the value is not decomposable. In order to store information each field has to be associated with a data type. A record contains information about a given person, place, event or thing. A record in an employee table would contain specific information about a particular employee.

Table/File/Relation A group of records having, the same structure. A table is-a collection of similar records, which means that all the records within a table must have the same structure (physical and logical). It captures all of the records of a particular type of entity. E.g. the employee table has all of the

Page 13: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 13

employee records. The structure of the table is described by the fields, that is, the type of data that will be held in the table.

Entity An entity is an object or event about which someone chooses to collect data. It may be a person, place, event or thing. E.g. Student, car, library book, employee, bank account etc.

Database (db)/Information Repository

A database is a collection of tables, which collectively stores and provides the information needed by an organization.

Common types of databases in society include:- Payroll, Employee data, Inventory management/Stock, Sales Customer data, Supplier data, Library book management, Banking, Student Registration

Database Management Systems (DBMS)

Complex system software which constructs and maintains the database in a controlled way. It allows creation, access, and management of a database. A database system is essentially nothing more than a computerized record-keeping system. The users will have the following facilities: add new files, insert new data, retrieve data, update data, delete data, and delete files.

Key Attribute(s) used to identify an entity Primary key The primary key is one or more fields whose values uniquely identify

each record in a table. A primary key cannot allow Null values and must always have a unique index. (Null values indicate that the field is empty). A primary key is used to relate a table to foreign keys in other tables. Fields that could be used as primary keys include:- TRN, Student id number, Employee id number, License plate number, Passport number, NIS number, Chassis number, Engine number, Part number, Reference number, ISBN on books, Bar code Department id etc.

Secondary key A set of attributes used for identifying records but not uniquely (e.g. Name)

Candidate key (minimal superkey)

An attribute that can serve as a primary key. (an alternate key). It can allow null values. E.g. on an employee table the TRN may be used as the key but the NIS No. is also unique.

Composite key A primary key that consists of two or more attributes Foreign key The primary key of one entity that is placed in a second entity for the

purpose of accessing the first entity

Superkey All keys are superkeys, but not all superkeys are keys. A super key is a collection of one or more fields whose collective value creates a unique value. The importance of a super key is that it allows us to make a distinction between the records, which are stored in a table

Index key This is a field or a collection of fields whose collective value is used to order the information in a database table. The main purpose of an index is to speed up data retrieval

Query A question about the data stored in your tables, or a request to perform

Page 14: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 14

an action on the data. It can bring together data from multiple tables to serve as the source of data for a form, report, or data access page.

Form A database object on which you place controls for taking actions or for entering, displaying, and editing data in fields.

Report A database object that prints information that is formatted and organized according to your specifications.

Non prime attribute An attribute that is not a part of the primary key

Sample Payroll Database Structure

Page 15: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 15

The traditional/file oriented approach

The file processing approach is an approach to storing and managing data where each department within an organization typically has its own set of files. The focus is on procedures. Data flows from program to program. Files are designed to meet needs of a given program. The file approach is often called the traditional approach. In this methodology, the process of' data management is " handled in an unstructured and ad-hoc" (unplanned) manner. This means that the data files and the programs which manipulate these files are created on a departmental basis without due consideration of the needs of the other departments.

Can you use the above approach to do this query? Find the employees making < $23000 who a) work in warehouse with floor area larger than 30000 square feet. b) have issued an order to supplier “S6”.

Problems with the Traditional approach The problems created by this approach may be divided into 2 categories : data problems and programming problems.

Data problems These problems were brought about by the differences in the format of the duplicated data. These differences were typically seen in 3 areas:

• Typographical errors in the duplicated data

• Data type differences in the duplicated data

• Differences in the logical representation of the duplicated data Programming problems The programming languages that were available during this period of time were all 3rd Generational Languages, which are also known as procedural languages. Procedural languages suffer from two deficiencies, which makes it difficult to write programming routines that manipulate data within the data files These 2 deficiencies are known as structural dependence and data dependence.

Page 16: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 16

Structural dependence this is the situation in which a programmer needs to have a knowledge of the representation of the logical structure of a file in order to write programming routines to manipulate the data within the file. The logical structure of a file is concerned with the order in which data occurs within the file. Data dependence This is the situation in which a programmer needs to have a knowledge of the physical representation of the data within the file in order to write programming routines to manipulate the data.

The problems are as follows:

• Application program dependent. E.g Prog 1 cannot access directly those files designed for Prog 2 (Files are often design specifically for their particular application)

• Separated and Isolated data – Resulting in difficulty to access data stored in different files

• Incompatible files

• Files must be pre-sorted

• Redundant data can arise as new programs are written (The same fields are stored in multiple places, the chance for errors is increased. There are also typographical errors in the duplicated data)

• Inconsistent data arises when one program does an update and another does not.

• File structure changes severely impact existing programs.

• Poor data control – with no centralized control at the data element level it is common for the same data element to have multiple names

• Often difficult to understand

The database approach

Page 17: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 17

In the database approach many programs and users share the data in the database. Users access data using software called a Database Management System (DBMS). The focus is on the data and not on procedures. The data resource is separate from the programs.

In the database approach data management is handled in a structured and planned manner. The first step in the database approach is to perform a data requirements analysis of the organization as a whole. In other words we are concerned with identifying the data needs of the organization not just the data needs of the specific department. This results in a pool of centralized data, which is then shared among the various organizational departments. The first step in the database approach is geared towards solving the data problems that were present in the file approach. We have now eliminated the duplication of data by ensuring that there is a centralized pool of data, which is accessed by the entire organization. The result is that all the problems that were generated because of duplicated data are now eliminated. The second step in the database approach is the use of a 4th Generational Language, which is also known as a nonprocedural language. A nonprocedural language does not suffer from the deficiencies of a 3rd Generational Language. In fact, a 4th Generational Language supports structural independence and data independence.

� Structural independence – the situation in which the logical representation

of a file structure is not needed in order to write programming routines for manipulating the file contents.

� Data independence - the situation in which the physical representation of

data is not needed in order to write programming routines for manipulating the data.

Database - An organized collection of data. A set of related files.

Formal Definition: A database is a single organized collection of structured data, stored

with minimum of duplication of data items so as to provide a consistent and controlled

pool of data. This data is common to all users of the system but is independent of

programs which use the data.

DBMS (Database management systems)

• The DBMS is an item of complex system software which constructs and maintains the database in a controlled way. It allows creation, access, and management of a database.

• It consists of a collection of interrelated data and a collection of programs to access that data. The data describe one particular enterprise. A DBMS is usually purchased from a software vendor and is the means by which an application program or end-user views and manipulates data in a database.

Page 18: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 18

• It also provides the interface between the user and the data. (The user is unaware of the structure of the database. The DBMS provides user with the services needed and handles the technicalities of maintaining and using the data.)

• The DBMS also allocates the storage to data.

• It maintains indices so that any required data can be retrieved, and so that separate items of data can be cross referenced. (Research: Look up hashing)

• The DBMS also has the function of providing security for the data. The main aspects of this are:- protecting data against unauthorized access, safeguarding data against corruption, providing recovery and restart facilities after a hardware/software failure.

• The DBMS keeps statistics of the use made of the data. This allows redundant data to be removed.

• It also allows data which is frequently used to be kept in a readily accessible form so that time is saved.

Functions common to most databases

• Data Dictionary (DD) o Is sometimes called a repository o Contains data about each file in the DB and each field within the files o Should only be updated by skilled personnel o Is used to perform validation checks o Allows users to specify a default field

• File retrieval and maintenance o Many tools provided o Involves adding new records, updating existing records and deleting

unwanted records

• Query Language o Allows users to specify data to be displayed, printed or stored o Consists of simple English-like statements o Each has its own grammar and vocabulary o Usually quickly learned by non-programmer

• Form o A window used to enter and change data o When well designed validates data as entered reducing data entry

errors

• Report Generator o Also called report writer o Allows users to design a report on the screen o Normally used only to retrieve data

• Data Security o A DBMS provides means to ensure that only authorized users access

users at permitted times o Most DBMSs allow different levels of access privileges

• Backup and Recovery o A DBMS provides a variety of techniques to restore a damaged or

destroyed database to usable form.

Page 19: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 19

o A Backup or copy of the entire database should be made on a regular basis

o Some DBMSs maintain a log of activities

Advantages of databases

• Data is managed by the DBMS

• Program independent

• Information supplied to managers is more valuable because it is based on a comprehensive collection of data instead of files which contain only the data needed for one application. (Total availability). [Data is centralized and integrated]

• Shared data –

• Data belongs to and are shared, usually over a network, by the entire organization.

• Security settings are usually used to define who have access to what level.

• As well as routine reports, it is possible to obtain ad hoc reports to meet particular requirements.

• Easier Access – non-technical users can access and maintain data if afforded the necessary privileges. [Better service to the users]

• There is an economic advantage in not duplicating data. In addition, errors due to discrepancies between 2 files are eliminated.

• The amount of input preparation needed is minimized by the single input principle. (This means that there is little duplication of data, one transaction will cause the necessary changes to be made to the data). (Reduced data redundancy – most data items are stored in only one file which greatly reduces duplicate data)

• Improved data integrity – data modification is accomplished by changing only one file, reducing the probability of introducing inconsistencies and redundancies

• A great deal of programming time is saved because the DBMS handles the construction and processing of the files and the retrieval of data. (Reduced development time)

• The integration of different business systems is greatly facilitated.

• Data definition and documentation are standardized.

Disadvantages of databases

• Requires more memory, storage and processing power

• Data are more vulnerable than in file processing systems [Research – The History of databases, 1970 – E.F. Codd]

Page 20: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 20

Components of a DBMS The components of a Database System are as follows:

• Database • Software – DDL, DML, Query Language, Report Generator/Writer (see unit III for more details) • Hardware • Users

Page 21: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 21

The different types of databases/Database Models

Every database and DBMS is based on a specific data model. The data model consists of the rules that define how the database organizes data and how users view the organization of data. Databases are classified according to the approaches taken to database organization. The classes are:

• Relational

• Network

• Hierarchical

• Object Oriented

• Multidimensional

A data model is a representation of data and its interrelationships which describe ideas about the real world.

The hierarchical and network database models store its data in a series of records, which have a set of field values attached to it. They collect all the instances of a specific record together as a record type. These record types are the equivalent of tables in the relational model, and with the individual records being the equivalent of rows. Links between the record types are created using Parent-child relationships.

Hierarchical

A hierarchical system is one that is organized in the shape of a pyramid, with each row of objects linked to objects directly beneath it. Hierarchical systems pervade everyday life. Examples of hierarchical systems in society are:

• The army which has generals at the top and privates at the bottom

• The classification of plants and animals according to species, family, genus etc. Examples of hierarchical systems in computers are:

• File system – a hierarchy of folders and sub-folders in which files are placed.

• Menu driven system – systems of main menus and sub-menus below. (E.g. when you click on File another menu comes up under it).

The hierarchical model is the oldest of the database models, and unlike the network, relational and object oriented models, does not have a well documented history of its conception and initial release. It is derived from the Information Management Systems of the 1950's and 60's. It was adopted by many banks and insurance companies who are still running it as a legacy system to this day. Hierarchical database systems can also be found in inventory and accounting systems used by government departments and hospitals.

The hierarchical model is a tree structured model and consists of many record types with one being the root. The root record type exists at the top of the tree. All data must be accessed through the root. One-to-many relationships exist between records in the hierarchy with one being the parent and the other the child. Each child has a unique

Page 22: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 22

parent and a parent can have many children. This child/parent rule assures that data is systematically accessible. To get to a low-level table, you start at the root and work your way down through the tree until you reach your target. Of course, as you might imagine, one problem with this system is that the user must know how the tree is structured in order to find anything.

For example, in the diagram below, the root record type is customer, the parent of order is customer, the parent of parts is order. In order to access an order, you must first access the customer (e.g. by knowing the customer#). Order has two children which are parts and salesman. In order to access the parts, you must first access the customer then the order. The path to the parts record type is therefore Customer, Order, Parts. Hierarchical structures were widely used in the first mainframe database management systems. However, due to their restrictions, they often cannot be used to relate structures that exist in the real world. Hierarchical relationships between different types of data can make it very easy to answer some questions, but very difficult to answer others. If a one-to-many relationship is violated (e.g., a patient can have more than one physician) then the hierarchy becomes a network.

The hierarchical model is no longer used as the basis for current commercially produced systems, however, there are a large number of legacy (old) installations. These legacy systems are likely to be phased out over time, as the number of qualified staff declines due to retirement and retraining.

Examples of hierarchical databases include:

• IMS - Information Management Systems by IBM

• System 2000 by MRI systems corp.

• Adabas

• GT.M

• Caché

• Multidimensional_hierarchical_toolkit

• Mumps_compiler

Page 23: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 23

Advantages of the Hierarchical Model � Data is unified since all records stem from the root � Easier to secure the database since you can access data through only one

path � Good for large volumes of one-to-many relationships � Adding, updating, and deleting records is more efficient and accurate than

the network model

Disadvantages of the Hierarchical Model

• Software dependence (Changes to the database structure requires modification to all programs which access the database)

• You cannot add a record to a child table until it has already been incorporated into the parent table. This might be troublesome if, for example, you wanted to add a student who had not yet signed up for any courses. In the diagram above, you cannot add a new salesperson until there is a customer and an order.

• Cannot (difficult) show many-to-many relationships

• One-to-many relationship can result in redundant data

• Not flexible enough to support ad-hoc queries

• Data can only be accessed through the right path

• It is not user friendly as users have to know the structure in order to access data through the right path

Network

The network model is a database model conceived as a flexible way of representing objects and their relationships. Its original inventor was Charles Bachman, and it was developed into a standard specification published in 1969 by the Conference on Data Systems Languages (CODASYL) Consortium. In many ways, the Network Database model was designed to solve some of the problems with the Hierarchical Database Model.

Where the hierarchical model structures data as a tree of record types, with each record type having one parent record and many children, the network model allows each

Page 24: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 24

record type to have multiple parent and child records, forming a lattice structure. This allows the model to support many-to-many relationships. There is no root record type. Data can therefore be accessed through more than one path. For example, in the diagram below, an order can be accessed through either the salesperson or the customer as order has salesperson and customer as its parents. Another way of saying it is that the child of salesperson and customer is order. The path to Parts is either Salesperson, Order, Parts or Customer, Order, Parts. You can therefore access parts by either knowing who the salesperson is or through the order by knowing for example, the order #.

The chief argument in favour of the network model, in comparison to the hierarchical model, was that it allowed a more natural modeling of relationships between entities. Although the model was widely implemented and used, it failed to become dominant for two main reasons. Firstly, IBM chose to stick to the hierarchical model in their established products such as IMS and DL/I. Secondly, it was eventually displaced by the relational model, which offered a higher-level, more declarative interface.

Examples of network databases include:

• Codasyl

• Total

• VAX-DBMS

• IMAGE of Hewlett Packard

• DMS-1100 of UNIVAC

• SUPRA of Cincom

Page 25: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 25

Advantages of the Network Model

• Many-to-many relationships are easily represented

• It is more flexible as you can access data through more than 1 path

• Represents redundancy more efficiently than hierarchical model Disadvantages of the Network Model

• Software dependence. (Changes to the database structure requires modification to all programs which access the database)

• Uses more processing time than the hierarchical structure

• Users must have knowledge of the structure of the database in order to navigate

• Hard to design, use and maintain

Relational

Relational databases consist of tables called relations. Relations are made up of tuples and attributes. The rows/records are called tuples. The columns/fields are called attributes. Relationships between relations are implicit in the overlapping attributes. All have the same simple format making them easy to set out under column headings. Each row normally has a unique identifying key. Most relational databases include Structured Query Language (SQL) a query language that allows users to manage, update and retrieve data (e.g. Oracle, MySQL, Ingres, db2, Sybase, Access, Visual FoxPro).

� Relational DB developer calls file a relation, record a tuple, and field an

attribute � Relational DB user calls file a table, record a row, and field a column

Cust-Name

Salesperson Order-No Salesperson Part-No Order-no

Page 26: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 26

Advantages of the Relational Model � Structural independence (i.e. Changes to the database structure DOES

NOT require modification to all programs which access the database � Powerful and flexible query mechanism that makes ad-hoc queries

possible � Easy representation of all types of relationships � Unification of data that minimizes redundancy and maximizes security

Disadvantages of the Relational Model

� Requires more space and processing power � Requires more planning if the database structure is to be designed properly

Entity Table/Relationship Table An entity table is a table structure which allows us to store a set of similar entities. A relationship table on the other hand is a table structure that enables us to show the associations, which exist among elements in, related entity tables.

Standard Notation Standard Notation is a format for writing database tables so that its logical structure may be understood. In standard notation, each table is given a unique name. The name of the table is then written in capital letters. Following the table name is a list of all the fields, which are found in the table. These fields are enclosed in brackets. The primary key field for the table is then underlined. For example, let us assume that we want to store the following information about a student: id, Fname, Lname and sex. Let us also assume that we want to store the following information about a subject: subld, subName, sublength. Finally- we want to show the relationship between each student and the subjects taken in another table called takes- If we assume- that, id is the primary key for the student table and if we assume the subid is the primary key for the subject table, we will end up with the following table structures in standard notation.

STUDENT(id, Fname, Lname, sex) SUBJECT(subid, subName, sublength) TAKES(id, subid)

Object-Oriented

Summary o Stores data in objects (An object contains data plus the actions that process the

data) o Can usually store more types of data than Relational databases o Can usually access data faster than the Relational DB

Page 27: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 27

o Stores unstructured data more efficiently than the Relational DB o Example FastObjects, GemStone

What is an Object?

An object generally is any item that can be individually selected and manipulated. This can include shapes and pictures that appear on a screen as well as less tangible software entities. In object-oriented programming an object is a self-contained entity that consists of both data and procedures to manipulate the data. In other words, an object is an item that contains data, as well as the actions that read or process the data. Real-world objects share two characteristics: They all have state and behavior. For example, dogs have state (name, color, breed, hungry) and behavior (barking, fetching, wagging tail). Bicycles have state (current gear, current pedal, two wheels, number of gears) and behavior (braking, accelerating, slowing down, changing gears). Software objects are modeled after real-world objects in that they too have state and behavior. You might want to represent real-world dogs as software objects in an animation program or a real-world bicycle as a software object in the program that controls an electronic exercise bike. You can also use software objects to model abstract concepts.

What is a Class?

A class is a category of objects. For example, there might be a class called shape that contains objects which are circles, rectangles, and triangles. The class defines all the common properties (characteristics) of the different objects that belong to it. A class is a special programming construct that allows us to create objects. In other words, a class provides the blueprint for the creation of an object. The class must specify a description of the data that is stored and a description of the operations that the object can provide.

As indicated above, each object must have a state and a set of methods, which are encapsulated (contained) inside the object. The state refers to the data that is stored inside the object, while the methods/behaviours refer to the set of operations/functions, which the object can perform. For example, a user can click on a button, put the mouse over the button, right click or double click on the button. Click, double click, right click, mouse over etc are therefore examples of methods. When the user clicks on the button, the relevant code for the particular user action is executed. Each object must have a set of well-defined public interfaces, which a client may use to get the object to perform a specific operation.

Examples of objects. An object oriented database can contain many classes of objects, these include:

• Command buttons

• List boxes

• Data windows

Page 28: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 28

• Windows

• Menus

• Text boxes

• Pictures

• Audio clips

• Video clips (animation)

• Students

• Courses

• Employees

What is an object-oriented database (OODB)?

Object-oriented databases or object database management systems grew out of research during the early to mid-1980s into having intrinsic database management support for graph-structured objects. The term "object-oriented database system" first appeared around 1985. An object-oriented database stores data in objects. The most significant characteristic of object-oriented database technology is that it combines object-oriented programming with database technology to provide an integrated application development system. Object-oriented databases are designed to work well with object-oriented programming languages such as Java, C#, and C++. An object contains data, as well as actions that read or process the data. A Member object, for example, might contain data about a member such as Member ID, First Name, Last Name, Address, and so on. It also could contain instructions on how to print the member record or the formula required to calculate a member's balance due. A record in a relational database, by contrast, would contain only data about a member.

Object-oriented databases have several advantages compared with relational databases. They can store more types of data, access this data faster, and allow programmers

to reuse objects. An object-oriented database stores unstructured data more efficiently than a relational database. Unstructured data includes photographs, video clips, audio clips, and documents. When users query an object-oriented database, the results often display more quickly than the same query of a relational database. If an object already exists, programmers can reuse it instead of recreating a new object - saving on program development time. For example, if a Close button exists on each screen, the programmer only needs to write the code once, then place the same button on each screen. This is called inheritance as discussed below. The following are features of an object-oriented database:

• Inheritance – the ability to create new objects by allowing them to automatically obtain the data members and the data operations of an existing class without rewriting the code that is present in the existing class.

Page 29: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 29

• Polymorphism (many forms) – the ability to have multiple classes of objects using the same interfaces although the implementation details may vary from object to object. For example, you can have a function/subroutine that calculates the area of an object. The way it calculates area depends on the type of object that called the function. This is because the formula for area is different for circle, rectangle, triangle etc. In other words, there is one function called CALCULATE_AREA and multiple objects will call this function, but the function behaves differently from object to object.

• Encapsulation – the ability of an object to hide its internal representation from the program that uses it. This is accomplished by defining public interfaces and by specifying that these public interfaces must be used when accessing the internal data.

• Information-hiding - an object has a public interface that other objects can use to communicate with it. The object can maintain private information and methods that can be changed at any time without affecting other objects that depend on it. You don't need to understand a bike's gear mechanism to use it.

Examples of object oriented databases include:

• FastObjects

• GemStone

• KE Texpress

• ObjectStore

• Versant

Examples of applications appropriate for an object-oriented database include the

following:

• A multimedia database stores images, audio clips, and/or video clips. For example, a geographic information system (GIS) database stores maps. A voice mail system database stores audio messages. A television news station database stores audio and video clips.

• A groupware database stores documents such as schedules, calendars, manuals, memos, and reports. Users perform queries to search the document contents. For example, you can search people's schedules for available meeting times.

• A computer-aided design (CAD) database stores data about engineering, architectural, and scientific designs. Data in the database includes a list of components of the item being designed, the relationship among the components, and previous versions of the design drafts.

• A hypertext database contains text links to other types of documents. A hypermedia

database contains text, graphics, video, and sound. The Web contains a variety of hypertext and hypermedia databases. You can search these databases for items such as documents, graphics, audio and video clips, and links to Web pages.

• A Web database links to an e-form on a Web page. The Web browser sends and receives data between the form and the database.

Page 30: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 30

OODBs add database functionality to object programming languages. A major benefit is the unification of the application and database development into a seamless data model and language environment. As a result, applications require less code, use more natural data modeling, and code bases are easier to maintain. Object developers can write complete database applications with a modest amount of additional effort.

According to Rao (1994), "The object-oriented database (OODB) paradigm is the combination of object-oriented programming language (OOPL) systems and persistent systems. The power of the OODB comes from the seamless treatment of both persistent data, as found in databases, and transient data, as found in executing programs." Data is a database is said to be persistent (constant) because you can read a record at one point in time and read the record at another point in time and the record is still there. In other words, the record is not transient (temporary). In contrast to a relational DBMS where a complex data structure must be flattened out to fit into tables or joined together from those tables to form the in-memory structure, OODBs have no performance overhead to store or retrieve a web or hierarchy of interrelated objects. This one-to-one mapping of object programming language objects to database objects has two benefits over other storage approaches: it provides higher performance management of objects, and it enables better management of the complex interrelationships between objects. This makes object DBMSs better suited to support applications such as financial portfolio risk analysis systems, telecommunications service applications, world wide web document structures, design and manufacturing systems, and hospital patient record systems, which have complex relationships between data.

Page 31: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 31

Representation of an object oriented database.

In the sample website below, the object-oriented database contains buttons and a map. When the user clicks on a particular area of the map, information on that area will appear. When the user clicks on a button, there is a link to another web page. When the user puts their mouse over a button, a description of the button appears.

Object-Relational

What is a hybrid object-relational database (ORD)?

An object-relational database (ORD) or object-relational database management

system (ORDBMS) combines features of the relational and object-oriented data models. It is a relational database management system that allows developers to integrate the database with their own custom data types and methods. The term object-relational

database is sometimes used to describe external software products running over traditional DBMSs to provide similar features; these systems are more correctly referred to as object-relational mapping systems. Whereas RDBMS or SQL-DBMS products focused on the efficient management of data drawn from a limited set of data types (defined by the relevant language standards), an object-relational DBMS allows software developers to integrate their own types and the

Page 32: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 32

methods that apply to them into the DBMS. The goal of ORDBMS technology is to allow developers to raise the level of abstraction at which they view the problem domain.

Object-relational database management systems (ORDBMSs) add new object storage capabilities to the relational systems at the core of modern information systems. These new facilities integrate management of traditional fielded data, complex objects such as time-series and geospatial data and diverse binary media such as audio, video, images, and applets. An applet is an application that has limited features, requires limited memory resources, and is usually portable between operating systems. By encapsulating methods with data structures, an ORDBMS server can execute complex analytical and data manipulation operations to search and transform multimedia and other complex objects. As an evolutionary technology, the object-relational (OR) approach has inherited the robust transaction- and performance-management features of its relational ancestor and the flexibility of its object-oriented cousin. Database designers can work with familiar tabular structures while assimilating new object-management possibilities.

Examples of Object-relational databases include:

• DB2

• JDataStore

• Oracle

• Polyhedra

• PostgreSQL

What is Object Definition Language (ODL)? Object-oriented and object-relational databases often use a query language called object query language (OQL) to manipulate and retrieve data. These databases also have an object definition language (ODL). ODL is used to define and manipulate the objects in the database. ODL must specify a description of the data that is stored in objects as well as a description of the operations that the object can provide. For example, an object could be defined as being a command button. Code could be written to manipulate the button in various ways such as: raise the button, move its location, bring it into focus, enlarge it etc.

Multidimensional o Stores data in dimensions. o The number of dimensions varies o Most have a time dimension o Examples: D3, Oracle Express

The following shows the difference between the relational view of sales data and the multidimensional view of sales data.

Page 33: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 33

Relational View

INVOICE Table LINE Table

Number Date Customer Amount Number Product Price Quantity

2034 15/5/96 Dartonik $3500 2034 Mouse $150 20 2035 15/5/96 INC $1800 2034 Diskette $50 10

2036 16/5/96 Dartonik $2000 2037 16/5/96 INC $800

Multidimensional View

Time Dimension

Customer Dimension 15/5/96 16/5/96 Totals Dartonik $3500 $2000 $5500

INC $1800 $800 $2600 Totals $5300 $2800 $8100

Sales figures occur at the intersection of a customer row and time column

[Extra Research: semi-structured model, associative model]

Page 34: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 34

UNIT II: DATABASE DESIGN

Introduction to the Database System Life Cycle (DBLC)

The DBLC is made up of the following phases:

• Database Analysis

• Database Design

• Database Implementation

• Database Testing and Evaluation

• Database Operation

• Database Maintenance

In designing a database it goes through this cycle. The steps in the cycle are further broken down as follows:

Analysis and design phase

Requirements formulation and analysis Logical Design Implementation design Physical design

Database implementation and operation phase Database implementation Operation and monitoring Modification and adaptation

Database Analysis This phase is done in the analysis phase of the SDLC. The main aim of database analysis is to perform the following function:

• Analyse the current situation of the company (initial study)

• Define the problems being experienced

• Define organizational objectives and business rules (for validation rules etc.)

• Define the scope and the boundaries of the project Database Design

This phase is concerned with performing the following functions:

• Conceptual Design – how data relates to each other (models the real world) (e.g. ERD)

• Logical Design – the information content of the database (tables/objects and links)

• Physical Design – layout on secondary storage (indexing, data types, access methods etc.)

Page 35: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 35

[Research the various access methods: Indexed, Sequential, Random/Direct] Database Implementation This phase is concerned with the actual creation of the database with respect to the database design that was constructed above. In addition this phase is also concerned with the implementation of security routines, business rules, concurrency control etc. We are also concerned with the population of the database. [In other words this is where we create the database structure using SQL commands] Database Testing and Evaluation

This phase is concerned with running tests to ensure that the database will meet the needs of the organization. This involves verifying that the appropriate business rules are being called, that the security of the database is indeed intact etc. The failure of evaluation criteria may signal changes in the conceptual, logical or physical layers. This phase also involves testing of the programs that will use the database to ensure that the interface works. Database Operation In this step users are actually using the database (e.g. adding records) through the relevant application software. Database Maintenance This phase is concerned with ensuring that the database is functional and reliable. This includes making modifications (e.g. adding new fields, increasing field sizes etc.). Maintenance is often attained by performing the following activities:

• Preventive maintenance (e.g. backup)

• Corrective maintenance (recovery from failure)

• Adaptive maintenance (adding new entities, enhancing performance etc)

• Performing periodic security audit checks

Page 36: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 36

Roles of database personnel

Data modellers Database design seeks to design the logical and physical structure of one or more databases to accommodate the information needs of the users in an organization for a defined set of applications". The design process roughly follows five steps:

1. planning and analysis 2. conceptual design 3. logical design 4. physical design 5. implementation

The data model is one part of the conceptual design process. The most widely used form of data modelling is the Entity-Relationship (ER) approach. The role of the data modeller therefore is to create the data model or to carry out conceptual database design.

Business Analysts

This person has both business and computer knowledge. The term Business Analyst (BA) is used to describe a person who practices the discipline of business analysis. A business analyst or "BA" is responsible for analyzing the business needs of clients to help identify business problems and propose solutions. Within the systems development life cycle domain, the business analyst typically performs a liaison function between the business side of an enterprise and the providers of services to the enterprise. Common alternative titles are systems analyst, and functional analyst, although some organizations may differentiate between these titles and corresponding responsibilities.

The International Institute of Business Analysis has the following definition of the role: "A business analyst works as a liaison among stakeholders in order to elicit, analyze, communicate and validate requirements for changes to business processes, policies and information systems. The business analyst understands business problems and opportunities in the context of the requirements and recommends solutions that enable the organization to achieve its goals."

The British Computer Society proposes the following definition of a business analyst: "An internal consultancy role that has responsibility for investigating business systems, identifying options for improving business systems and bridging the needs of the business with the use of IT."

This person critically evaluates the information gathered. He/She should have strong analytical skills and can therefore translate business needs to requirements. He also has good communication skills and is able to challenge business units.

Database Designers The process of designing a database generally consists of a number of steps which will be carried out by the database designer. Usually, the database designer must:

• Determine the data to be stored in the database

Page 37: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 37

• Determine the relationships between the different data elements • Superimpose a logical structure upon the data on the basis of these relationships.

[See the Database Design section for more details]

Systems Analysts [see Business Systems course]

The systems analyst analyses and designs systems that meet the computer requirements of an

organization. He/She uses computer technology to solve problems. A systems analyst is responsible for researching, planning, coordinating and recommending software and system choices to meet an organization's business requirements. The systems analyst plays a vital role in the systems development process. A successful systems analyst must acquire four skills: analytical, technical, managerial, and interpersonal. Analytical skills enable systems analysts to understand the organization and its functions, which helps him/her to identify opportunities and to analyze and solve problems. Technical skills help systems analysts understand the potential and the limitations of information technology. The systems analyst must be able to work with various programming languages, operating systems, and computer hardware platforms. Management skills help systems analysts manage projects, resources, risk, and change. Interpersonal skills help systems analysts work with end users as well as with analysts, programmers, and other systems professionals.

Because they must write user requests into technical specifications, the systems analysts are the liaisons between vendors and the IT professionals of the organization they represent. They may be responsible for developing cost analysis, design considerations, and implementation time-lines. They may also be responsible for feasibility studies of a computer system before making recommendations to senior management.

Called Systems Architects in some companies.

Basically, a systems analyst performs the following tasks:

• Interact with the customers to know their requirements • Interact with designers to convey the possible interface of the software • Interact/guide the coders/developers to keep track of system development • Perform system testing with sample/live data with the help of testers • Implement the new system • Prepare documentation

Many systems analysts have morphed into business analysts.

Programmers Writes, tests, modifies computer programs. This person must be able to communicate effectively, write documentation, conduct training, consult with users, engineers etc. He/She also writes user manuals, communicates with users and trains them. [Please note that programming languages are numerous and change from time to time. Programmers should therefore have the ability to learn new languages on their own as technology changes.

Page 38: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 38

Database Administrators

A database administrator (DBA) is a person who is responsible for the environmental aspects of a database. Managing a company’s database requires a great deal of coordination. The role of

coordinating the use of the database belongs to the database administrator (DBA). The duty of a database administrator varies depending on job description, corporate and IT policies and the technical features and capabilities of the database management systems (DBMSes) being administered. They nearly always include disaster recovery (backups and testing of backups), performance analysis and tuning, and some database design or assistance thereof. Database administrators work with database management systems software and determine ways to organize and store data. They identify user requirements, set up computer databases, and test and coordinate modifications to the computer database systems. An organization’s database administrator ensures the performance of the system, understands the platform on which the database runs, and adds new users to the system. Because they also may design and implement system security, database administrators often plan and coordinate security measures. With the volume of sensitive data generated every second growing rapidly, data integrity, backup systems, and database security have become increasingly important aspects of the job of database administrators. Their salaries range from $65,000US to $86,000US depending on qualifications and experience. The administrative controls carried out by the DBA therefore include the following:

• Select and implement the DBMS

• Develop database models (e.g. Entity relationship diagrams)

• Create and maintain the data dictionary.1 This includes documentation of the data dictionary.

• Ensures that the database structure is documented

• Provides manuals describing the facilities the database offers and how to make use of these facilities. Provides the facilities for retrieving data and for structuring reports are appropriate to the needs of organization

• Manages and evaluates security of the database. (Includes backup and recovery

• Verifies database integrity

• Monitors performance of the database

• Recoverability - Checks backup and recovery/restore procedures

• Perform archiving (backup and remove historical data from current files)

• Appraise the performance of the database and takes corrective actions if performance degrades.

• Periodic appraisal of the data to ensure it is complete, accurate and not duplicated. (Monitor performance).

• Availability – ensures that the database is running when necessary

• Use query languages to obtain reports of the information in the database

1 A data dictionary (also called repository) is a DBMS element that contains data about each table in a database and each field within those tables.

Page 39: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 39

Although not strictly part of a database administrator's duties, logical and physical design of databases is sometimes part of the job. These functions are traditionally thought of as being the duties of a database analyst or database designer. [Research – Salaries of the above job titles in various companies]

Page 40: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 40

Database Design – Conceptual, Logical, Physical

Database Design is the process of developing a database structure from user requirements. It is the process of producing a detailed data model of a database. This logical data model contains all the needed logical and physical design choices and physical storage parameters needed to generate a design in a Data Definition Language, which can then be used to create a database. A fully attributed data model contains detailed attributes for each entity. The term database design can be used to describe many different parts of the design of an overall database system. Principally, and most correctly, it can be thought of as the logical design of the base data structures used to store the data. In the relational model these are the tables and views. In an object database the entities and relationships map directly to object classes and named relationships. However, the term database design could also be used to apply to the overall process of designing, not just the base data structures, but also the forms and queries used as part of the overall database application within the database management system (DBMS).

The Database Design Process

The process of designing a database generally consists of a number of steps which will be carried out by the database designer. Not all of these steps will be necessary in all cases. Usually, the designer must:

• Determine the data to be stored in the database • Determine the relationships between the different data elements • Superimpose a logical structure upon the data on the basis of these relationships.

Determining data to be stored

In a majority of cases, the person who is doing the design of a database is a person with expertise in the area of database design, rather than expertise in the domain from which the data to be stored is drawn e.g. financial information, biological information etc. Therefore the data to be stored in the database must be determined in cooperation with a person who does have expertise in that domain, and who is aware of what data must be stored within the system.

This process is one which is generally considered part of requirements analysis, and requires skill on the part of the database designer to elicit the needed information from those with the domain knowledge. This is because those with the necessary domain knowledge frequently cannot express clearly what their system requirements for the database are as they are unaccustomed to thinking in terms of the discrete data elements which must be stored. Data to be stored can be determined by Requirement Specification.

Page 41: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 41

Conceptual design

Once a database designer is aware of the data which is to be stored within the database, they must then determine how the various pieces of that data relate to one another. When performing this step, the designer is generally looking out for the dependencies in the data, where one piece of information is dependent upon another i.e. when one piece of information changes, the other will also. For example, in a list of names and addresses, assuming the normal situation where two people can have the same address, but one person cannot have two addresses; the name is dependent upon the address, because if the address is different then the associated name is different too. However, the inverse is not necessarily true, i.e. when the name changes address may be the same.

Logical Design

This involves the design of the entire information content of the database. It is the consolidation of all user requirements into a DBMS-independent information structure (conceptual schema). The conceptual schema accurately models the real world organization and its important data elements and relationships. The conceptual schema normally used is the ERD. Once the relationships and dependencies amongst the various pieces of information have been determined, it is possible to arrange the data into a logical structure which can then be mapped into the storage objects supported by the database management system.

In the case of relational databases the storage objects are normalized tables which store data in rows and columns. Each table may represent an implementation of either a logical object or a relationship joining one or more instances of one or more logical objects. Relationships between tables may then be stored as links connecting child tables with parents. Since complex logical relationships are themselves tables they will probably have links to more than one parent.

In an Object database the storage objects correspond directly to the objects used by the Object-oriented programming language used to write the applications that will manage and access the data. The relationships may be defined as attributes of the object classes involved or as methods that operate on the object classes.

Logical design results in the logical database structure.

Physical Design

This results in a physical database structure which is developed from the logical structure. This determines the layout or configuration on secondary storage.

In other words the physical design of the database specifies the physical configuration of the database on the storage media. This includes detailed specification of data elements, data types, indexing options, and other parameters residing in the DBMS data dictionary. It is the detailed design of a system that includes modules & the database's hardware & software specifications of the system.

Page 42: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 42

Physical design can be roughly divided into 3 steps:

• Stored record format design - concerned with the problem of formatting stored data by analysis of the characteristics of data item types, distribution of data item values, their usage of various applications.

• Stored record clustering - physical allocation of stored records. Record clustering places the same or different record types together in blocks on the storage device.

• Access method design - provide storage and retrieval capabilities for data stored on physical devices.

Page 43: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 43

Database Schema or Levels of abstraction in specifying a database structure

Look at the diagram above. Which is easier to see, the rabbit or the duck?

� Just as how different persons perceive the illusions in different ways, different users will view the data in different ways.

� Database schema is therefore based on how one views the data. E.g. Data can be viewed as entities with attributes or it can be viewed as groups of bits.

Definition of database schema

Database schema defines a database’s structure, its tables, relationships, domains, and business rules. Database schema is a design, the foundation on which the database and the application are built.

Explanation of the four database schema

• Conceptual schema - consists of attributes, entities, relationships The conceptual schema is also called the logical model, and is the basic database model, which deals with organizational structures that are used to define database structures such as tables and constraints. This represents a global view of the data. It is an enterprise-wide representation of data as viewed by high-level managers. This model is the basis for the identification and description of the main data objects, avoiding details. The most widely used conceptual model is the entity relationship (E-R) model. Using the E-R model yields the conceptual schema, which is, in effect the basic database blueprint. In other words, this schema is used to design the database structure. Conceptual schema provides a relatively easily understood bird’s eye view of the data environment. The conceptual schema is independent of both software and hardware. Software independence means that the model does not depend on the DBMS software used to implement the model. Hardware independence means that the model does not depend on the hardware used in the implementation of the model. Therefore, changes in either the hardware or the DBMS software will have no effect on the database design at the conceptual level.

• Internal schema - physical view - what analyst/programmer sees Once a specific DBMS has been selected, the internal model adapts the conceptual model to a specific DBMS. In other words, the internal model requires the database designer to match the conceptual model’s characteristics and constraints to those of the selected database model. The database designer will, for example, see the specific tables in the database and know which fields

Page 44: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 44

are on which table. Because the internal model depends on the existence of specific database software, it is said to be software-dependent. Therefore a change in the DBMS requires that the internal model be changed to fit the DBMS’s characteristics and requirements. [e.g. currency datatype is not on all DBMSes] The development of a detailed internal model is especially important to database designers who work with certain database models that require very precise specification of data storage location and data access paths. In contrast, the relational database model requires less detail in its internal model because most RDBMSes handle data access path definition transparently – that is, the designer need not be aware of the data access path details. Nevertheless, even relational database software usually requires data storage location specification, especially in a mainframe environment. For example, DB2 requires that the data storage group, the location of the database within the group and the location of the tables within the database be specified. The internal model is still hardware- independent because it is unaffected by the choice of the computer in which the software is installed. Therefore a change in storage devices or even a change in operating system will not affect the internal model’s design requirements.

• External schema - applications programmer or end user view This is based on the internal model. It is the end user’s view of the data environment or the applications interface. It deals with methods through which users may access the data (e.g. through the use of a data input form). By end users we mean the people who use the application programs as well as those who designed and implemented them. Whereas the database designer will know that fields are located on different tables, an end user may see every field on one screen (form). This user therefore views the fields as if they were on one table. The end user will not need to know that the data is separated into different tables. Some fields may also be missing from the user’s screen. The user does not necessarily need to know about these fields in order to perform his tasks.

• Physical schema – way data is stored on secondary storage The lowest level of abstraction describes the way data is saved on storage media such as disks or tapes. This model requires the definition of both the physical storage devices and the physical access methods required to reach the data within those storage devices. It is both software and hardware dependent. A change to either the DBMS software or hardware would require a change to the database model.

Attributes of storage media o Tracks Bits (0s and 1s) are the smallest unit of data. The bits are commonly stored on tracks. The 0 is a non magnetized spot on magnetic storage devices or as pits (holes) burnt in the surface of optical storage devices.

o Sectors Data can be grouped in blocks called sectors. A sector on magnetic disk for example is in the shape of a pizza slice/wedge. The block is read or written to at once.

Page 45: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 45

o File Organization and Access Methods

When data are stored on secondary storage devices, the method of organization chosen will determine how the data can be accessed. In turn, this will affect the types of applications that can use the data, as well as the time and cost necessary to do so.

a) Sequential With sequential file organization records are stored physically in order in the file. This can be in alphabetical or numerical order. Processing begins at the first logical record and proceeds through each record in the file until the final record has been read or written. Records cannot be inserted in the middle of the file. In order to modify a file the original file (master) is changed by creating transactions in a transaction file. The transaction file is processed and a new master file is created based on the transactions. Any type of storage device can access sequential files. Magnetic tape is a sequential access device and can only use sequential files.

Advantages

� This method can use magnetic tape which is the least expensive method of storage.

� It is the most efficient form of organization when the entire file, or most of it, must be processed at once.

� Transaction and old master files act as a backup, should the new master file be damaged or destroyed.

Disadvantages

� This method can be slow when trying to locate a record near to the end of the file. � The entire file must be processed and a new master file created even if only one

record requires maintenance or updating.

b) Indexed Records have a unique key which is a pointer to the record in order to access them. The pointers exist in an index file (separate from data file) and direct you to the next logical record. The records are not physically in logical order. In order to access the file sequentially, you follow the sequence of the pointers. Files can be ordered in many ways by using more than one sets of pointers.

For example, Alice is the first logical record, the next record logically (alphabetically) is Boris. The pointer after Alice therefore says 4. In other words, if you want to know what the record is after Alice, go to record number 4 to find it.

Page 46: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 46

Physical Rec# Data Index/Pointer 1 Mary 5

2 Alice 4

3 Jane 1

4 Boris 3

5 Peter 2

Advantages

� Data can be accessed sequentially or directly. � No transaction files are maintained. � If an index is lost, the data still exists.

Disadvantages

� Indexes lower efficiency � Indexes can be damaged, thus the sequencing is lost. � There is no backup of the master file. Procedures must be established to ensure

the regular creation of backup files.

c) Direct (Random) The data in this method may be organized in such a way that they are scattered throughout the disk in what may appear to be a random order. Direct access permits access to any record without the necessity to read other records in the file. To accomplish this each record is uniquely identified by a key. The key is used to calculate an address for the record. This method is known as hashing. Hashing is a method used for determining the physical location of a record. In this method, the primary key is processed mathematically and another number is computed that represents the location where the record will be stored. When a user retrieves the record, its key is entered, and the hashing routine is used to determine where the record can be found. The problem with hashing however is that different keys processed can sometimes result in the same number or the same storage locations, leading to “collisions”. The second record must then be stored in an overflow area. This reduces the efficiency of the retrieval process, because the search for the right record becomes more complex through the use of overflow areas and thus becomes more time-consuming. Once accessed, a record can be read or updated. This method requires the use of direct access devices such as magnetic disk.

Advantages

� Data can be accessed directly and quickly. � Files can also be processed sequentially. � Data is easily kept up-to-date.

Disadvantages

� This is more expensive than sequential. [Research - The levels of the ANSI/SPARC database architecture]

Page 47: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 47

Entity- Relationship Diagrams

A data model is a pictorial abstraction of the contents of a database. The major function of the data model is to provide a simplified view of the database contents in a form that is easily understood by the client, the end-user, the application programmer and the database designer. The most popular diagramming technique that is used to create the data model is known as the Entity - Relationship diagram.

An Entity-Relationship Diagram (ERD), also known as the Entity Relationship Model, is a specialized graphic that shows the interrelationships between entities in a database. Entity-Relationship diagrams (ERDs) emerged in the 1970's from work by Dr. Peter Chen and others. They were looking for means to simplify the representation of large and complex data storage concepts. The purpose of an ERD is to design a database structure. They can also be used with clients to discuss business rules. An entity is an object or event about which someone chooses to collect data. It may be a person, place or thing etc. Examples are: student, car, employee, song, customer, library book, product, patient. Entities can be thought of (roughly) as nouns. Entities are drawn as rectangles. [Research: Weak entity

2, cardinality, existence-dependent, supertype entity, subtype entity]

An entity has certain attributes. An attribute is a characteristic of an entity or it can be defined as the data collected about the entity. Examples are: name, address, sex, date of birth, eye color, title, product code, blood type etc. (The attributes equate to the fields/data items). A record would form a collection of these data items. The attribute that would uniquely identify a particular entity would be the primary key field. This field is the data item that uniquely identifies the record.

Types of relationships A relationship is an association between entities. A relationship captures how two or more entities are related to one another. Relationships can be thought of (again, roughly) as verbs. Examples: an owns relation between a company and a computer, a supervises relation between an employee and a department, a performs relation between an artist and a song. There are three types of relationships that can exist between entities. These are:

• One-to-one (1:1) i.e. Each entity A has only one entity B. E.g. A product can have only one package.

• One-to-many (1:m) i.e. each entity A has many entity Bs. Each B has only one A. E.g. A teacher of this subject can have many students, but a student has only one teacher in the subject.

2 Cannot be uniquely identified by its own attributes alone

Page 48: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 48

• Many-to-many (m:n) i.e. Each entity A has many entity Bs. Each B has many As. E.g. A doctor can have many patients, a patient can have many doctors.

The symbols used in an ERD

• Entity – represented by a rectangle

• Relationship – represented by a diamond or a line depending on convention

• Type of relationship – represented by a diamond with a number or lines depending on convention

• Attribute – represented by ovals outside the entity or listed inside the entity depending on convention

Sample ERDs

Convention 1 - Chen

In this convention, entities are represented by rectangles, relationships are represented by diamonds and attributes by ovals. The name of the relationship is written in the diamond. The name of the attribute is written in the oval. The oval is attached to the entity with a line. The type of relationship is represented by 1, m or n. For example, based on the 3 diagrams above there is a one-to-many relationship between Dept and Emp; there is a many-to-many relationship between Salesman and City; there is a one-to-one relationship between Office and Emp. Dept has an attribute called manager.

Page 49: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 49

Convention 2 - Martin

In this convention, entities are represented by labelled rectangles. The label is the name of the entity. Entity names should be singular nouns. Relationships are represented by a solid line connecting two entities. The name of the relationships is written above the line. Relationship names should be verbs. Attributes, when included, are listed inside the entity rectangle (e.g. DeptID and ProjectID). Attributes which are identifiers are underlined. Attribute names should be singular nouns. A “one” is represented by a single line attached to the entity and a “many” is indicated by a “crow’s foot” or three lines. The above diagram shows a one-to-many relationship (one department to many projects). Mandatory existence is represented by placing a perpendicular bar on the line next to the mandatory entity. Optional existence is represented by placing a circle on the line next to the optional entity. The diagram shows that Departments are mandatory but Projects are optional.

Page 50: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 50

Example of Creating the ERD

Consider a hospital:

Patients are treated in a single ward by the doctors assigned to them. Usually each patient will be

assigned a single doctor, but in rare cases they will have two.

Heathcare assistants also attend to the patients, a number of these are associated with each ward.

Initially the system will be concerned solely with drug treatment. Each patient is required to take a

variety of drugs a certain number of times per day and for varying lengths of time.

The system must record details concerning patient treatment and staff payment. Some staff are paid

part time and doctors and care assistants work varying amounts of overtime at varying rates (subject to

grade).

The system will also need to track what treatments are required for which patients and when and it

should be capable of calculating the cost of treatment per week for each patient (though it is currently

unclear to what use this information will be put).

How do we start the ERD?

1. Define Entities: these are usually nouns used in descriptions of the system, in the discussion of business rules, or in documentation; identified in the narrative (see highlighted items above).

2. Define Relationships: these are usually verbs used in descriptions of the system or in discussion of the business rules (entity ______ entity); identified in the narrative (see highlighted items above). 3. Add attributes to the relations; these are determined by the queries, and may also suggest new entities, e.g. grade; or they may suggest the need for keys or identifiers. 4. What questions can we ask?

a. Which doctors work in which wards? b. How much will be spent in a ward in a given week? c. How much will a patient cost to treat? d. How much does a doctor cost per week? e. Which assistants can a patient expect to see? f. Which drugs are being used?

5. Describe the type of relationship between the entities Many-to-Many must be resolved to two one-to-manys with an additional entity Usually automatically happens Sometimes involves introduction of a link entity (which will be all foreign key) Examples: Patient-Drug

6. This flexibility allows us to consider a variety of questions such as: a. Which beds are free? b. Which assistants work for Dr. X? c. What is the least expensive prescription? d. How many doctors are there in the hospital? e. Which patients are family related?

7. Represent that information with symbols

Page 51: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 51

Entity and Referential Integrity

As a database designer you will discover that database integrity rules are essential if you are to create a good database design. Although some Relational DBMS automatically enforce these rules, you still need to be aware of them. Entity integrity – this states that all records must have a primary key and the primary key value must never contain a null or undefined value. The purpose of this rule is to ensure that each record within a table have a unique identity. Referential integrity - this states that a foreign key must either have a null value or it must have a matching primary key value in the table to which it is related- The purpose of this rule is to ensure that there are no illegal entries within the relationship tables. It also prevents us from deleting records whose primary key value has a corresponding match in a relationship table.

Page 52: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 52

ERD Exercises

Exercise 1 Man is married to wife Manager manages employee Lecturer teaches student Student studies course

Exercise 2 An artist belongs to a band. The artist can make a CD if he wishes. A CD contains one or more tracks on it.

Exercise 3 Consider a construction firm: Employees belong to a particular department. Each employee is employed to perform a particular task. The system should capture the employee’s name, TRN, address and other contact information. The system should also capture information about the department such as department name, location, and supervisor. The system should capture information about the task performed by an employee such as description, date assigned, deadline date, and hourly rate.

Exercise 4 A Sales Rep serves none, one or more customers at a time. A customer can place as many orders as he would like to. An order lists one or many products. Products that are available are stored in the company warehouse.

Exercise 5 A company has several departments. Each department has a supervisor and at least one employee. Employees must be assigned to at least one, but possibly more departments. At least one employee is assigned to a project, but an employee may be on vacation and not assigned to any projects. The important data fields are the names of the departments, projects, supervisors and employees, as well as the supervisor and employee number and a unique project number.

Exercise 6 A Metropolitan Bus Company owns a number of buses. Each bus is allocated to a particular route, although some routes may have several buses. Each route passes through a number of towns. One or more drivers are allocated to each stage of a route, which corresponds to a journey through some or all of the towns on a route. Some of the towns have a garage where buses are kept and each bus is identified by the registration number and can carry different numbers of passengers, since the vehicles vary in size and can be single or double-decked. Each route is identified by a route number and information is available on the average number of passengers carried per day for each route. Drivers have an employee number, name, address, and sometimes a telephone number.

[Entities: Bus, Route, Town, Driver, Stage. Relationships: Bus-route - is serviced by / route-stage – comprises / driver-stage - is allocated / stage-town - passes-through/ route-town - passes-through / garage-town - is situated/ garage-bus - is garaged]

[Research Martin & Chen]

Page 53: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 53

Functional Dependencies

Definition - Let R(A1, A2, …., An) be a relational schema (i.e. a relation/table with

attributes A1 etc.), and let X and Y be subsets of (A1, A2, …., An) - we allow for the case

where X and Y are composite. We say that X functionally determines Y (or Y is functionally dependent on X), written as X --> Y, if for each value of X there exists exactly one value of Y. A functional dependency allows us to use the value of one attribute and predict the value of another attribute.

Example SUPPLIERS (name, address, item, price) Here are 2 FDs

• name --> address - Given a particular value of name there exists precisely one corresponding value for each address.

• name + item --> price. NB. If X is the primary key (or a candidate key) then all attributes Y or relation R must be

functionally dependent on X.

Computation of Closures

Definition - Let F be the set of functional dependencies for relation R, and let X --> Y be a given functional dependency. Then F logically implies X --> Y (written F |= = X --> Y) if every relation (instance r of R) that satisfies the dependencies in F also satisfies X --> Y. Example { A --> B, B --> C } |= = A --> C. Definition - The closure of F, F+, is the set of FDs that are logically implied by F, that is, F+ = {X --> Y : F |= = X --> Y}. If have a set of FDs then closure is another set of FDs

that is implies.

Closure can be used to find keys of a relation. Definition - Consider the relational schema R(A1, A2, …., An) and the set of FDs F, and let X be a subset of {A1, A2, …, An}. Then X is a (candidate) key if : (a) X --> {A1, A2, …., An} is in F+.

(b) X is a minimal key, that is, for no proper subset Y ⊂ X is Y --> {A1, A2, …. An) in F+. Example

Page 54: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 54

Let R(A, B, C) and F = {A --> B, B --> C}. What is the key? F |= = A --> C if every A gives a value C. If F |= = A --> C then it follows therefore that A --> B, A --> C, A --> A, that is, A --> {A, B, C}. Since A is a single attribute, it has no proper subsets. Hence A is a key.

Algorithm for finding the closure of a set of attributes

Given a set of attributes U, a set of FDs F, and a set X ⊆ U. To find X+, the closure of X. Method

1. X(0) is X 2. X(i+1) is X(i) plus (i.e. unioned with) the set of attributes A such that there is some

dependency Y --> Z, in F, such that A is in Z, and Y ⊆ X(i). (i.e. X(i)

--> Z so X(i)

U Z) Note: We will eventually reach i such that X(i) = X(i+1) . There is then no need to compute beyond X(i) once we discover that X(i) = X(i+1) . Also the process terminates if X(i) = U. If X(i) = U then X is a key. Example Given relation R (city, st, zip) and nontrivial FDs city + st --> zip and zip --> city. To show that city + st is a key. Let X = {city, st}. Using the above algorithm, we have X(0) = X = {city, st} we now look for dependencies of the form city --> Q1, st --> Q2, or city, st --> Q3. (city and st are subsets of X(0)). If all 3 exist, then X(1) = X(0) U Q1 U Q2 U Q3. There is one such dependency, namely, city, st --> zip Hence, X(1) = X(0) U zip = {city, st} U zip = {city, st, zip}, But {city, st, zip} = U Hence X+ = U, and X = {city, st} is a key.

Closure Exercises Exercise 1: By computing its closure, show that (i) st, zip is a key, (ii) city, zip is not a key. Exercise 2: Given Supplier (name, address, item, price) and F = {name --> address, name +item --> price}. Show that name + item is a key. Show if address, price is a key. Exercise 3: Given R(name, job, dept) and job, dept -> name and name -> dept. Determine if job, dept or dept, name or job, name is a key.

Page 55: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 55

Armstrong’s Axioms

These are rules used to determine/generate dependencies from other dependencies. Note: Armstrong’s axioms are sound and complete. They are sound because they do not generate any incorrect dependencies. They are complete because all FDs implied by F can be derived from F using the axioms. Given a relational scheme R, as set of attributes U and a set of functional dependencies F, the axioms are as follows:

Reflexivity

IF Y ⊆ X ⊆ U, then X --> Y is logically implied by F.

e.g. if X = name + item and Y= item (i.e. Y ⊆ X) then name + item --> item. Augmentation

If X --> Y holds, and Z ⊆ U, then XZ --> YZ. e.g. if item --> price then item + name --> price + name

Transitivity

If X --> Y and Y --> Z, then X --> Z.

Examples

Given the relation R (city, st, zip) and nontrivial FDs city + st --> zip zip --> city to show that both city + st and st + zip are keys for R. (a) zip --> city (given) (b) zip st --> city st (augmentation using (a) ) (c) city st --> zip (given) (d) city st --> city st zip (augmentation using (c) ) (e) st zip --> city st zip (transitivity using (b) and (d) ). Hence from (d) and (e) both city st and st zip are keys for R. [Both determine all fields]

EXERCISE

Given R(TRN, Name, Age, Addr, Year) and FDs TRN, addr -> Age Age -> Year Use Armstrong’s Axioms to come up with other FDs.

Page 56: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 56

Covers and their role in determining redundant FDs

If F and G are sets of dependencies, then F is equivalent to G if F+ = G+. In that case we say that F covers G (and G covers F). To test whether F and G are equivalent, we must show that every dependency in F is in G+ and that every dependency in G is in F+. In designing databases we ensure that the set of functional dependencies for a given schema is minimal, that is, that there are no redundant dependencies. We say that a set of dependencies F is a minimal cover, Fm, if:

1. Every right hand side of a dependency F is a single attribute. If any r.h.s. has more than 1 attribute, then split it. 2. For no X --> A in F is the set F - {X--> A} equivalent to F. That is, no dependency in F is redundant. 3. For no X-> A in F and proper subset Z of X is F - {X--> A} U {Z --> A} equivalent to F. That is, no attribute on the l.h.s. of any FD in F is redundant.

Note: Every set of dependencies F is equivalent to a set Fm that is minimal. Example Consider the set F = {A-->B, B-->A, B-->C, A-->C, C-->A}. A minimal cover, found by eliminating the dependencies B-->A and A-->C, is given by Fm = {A-->B, B-->C, C-->A}.

Algorithm to find redundant FDs.

1. Choose an FD, say X-->Y, and remove it from the set of FDs 2. result = X; while (result changes and Y is not contained in result) do for each FD, A-->B, remaining in the reduced set of FDs if A is a subset of result then result = result U B end 3. if Y is a subset of result then FD X -->Y is redundant.

Exercises - Find the redundant FDs in the following sets:

a) Colour -> Density, Density -> Elasticity, Colour -> Elasticity b) Lime -->Melon, Lime -->Naseberry, Naseberry Melon--> Orange,

Lime Melon -->Naseberry c) Name -> Addr, name, item -> price, name -> price d) Name -> id, Id -> name, Name, id -> dept,

Id -> dept e) Flight# -> destination, Origin, destination -> flight#,

destination, arrival time -> flight#, origin, flight# -> origin

Page 57: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 57

1st , 2nd , 3rd Normal Forms

Definition - An attribute of relation R is prime (sometimes called key) attribute if it participates in a key. Example : If A + B + C is a key for a relation R, then attributes A, B, and C are prime attributes.

Definition - Normalization is a process of obtaining stable groupings of attributes into relations. It is a process of decomposing a table into smaller, simpler tables. In addition to being simpler and more stable, normalized tables are more easily maintained. Normalization is the process of eliminating data redundancies and data anomalies from table structures by applying various rules called normal forms. Normalization organizes a database into one of several forms to remove ambiguous relationships between data and minimize data redundancy. In zero normal form (0NF), the database is completely non-normalized/unnormalized, and all of the data fields are included in one relation or table. The table has large rows due to the repeating groups and wastes disk space. There is also at least one value that is not atomic (that is, it can be decomposed further).

Note: When you break down a table into simpler tables always ensure that there is a common field that you will be able to use to join the tables back together for queries. The normalization process starts with unnormalized relations - where at least one value is not atomic (that is, it can be decomposed further.

Example

S# PQ

P# QTY

S1 P1 300 P2 200 P3 400

P4 200

S2 P1 300

P2 400

S3 P2 200

The field PQ can be broken down into P# (part number) and QTY (quantity). Other examples include:

• NAME can be broken down into FIRSTNAME, MIDDLENAME, LASTNAME.

• ADDRESS can be broken down into STREET#, STREET, CITY, ZIPCODE etc.

Definition - A relation is in first normal form (1NF) if: every attribute is a simple (atomic) attribute.

A table or relation is in first normal form (1NF) if: every field is a simple (atomic) field. A simple, atomic field is one that cannot be broken down further. A table is also in first normal form (1NF) if it contains no repeating groups. Note: Every normalized table is in 1NF. 1NF violations cause data redundancy, which

Page 58: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 58

may lead to data inconsistencies, poor data integrity, wastage of space, data anomalies etc. To convert to 1NF you should break down fields to their simplest and remove any repeating groups into another table.

Example: We can convert the above to 1NF as follows: S# P# QTY

S1 P1 300

S1 P2 200

S1 P3 400

S1 P4 200

S2 P1 300

S2 P2 400

S3 P2 200

Definition - A relation is in second normal form (2NF) if: a) it is in 1NF and b) it has no partial dependencies of nonprime (nonkey) attributes on keys. That is every nonprime attribute is fully dependent on the primary key. [Primary key -->

all attributes.] [NB. An attribute of relation R is prime (sometimes called key) attribute if it participates in a key. Example : If A + B + C is a key for a table R, then attributes A, B, and C

are prime attributes.]

Example Cars (model, cylinder#, origin, tax, fee). Key is model + cylinder#, and FD is model --> origin. Table cars in not in 2NF because origin is non prime and not fully dependent on model and cylinder# (i.e. key). Model and cylinder# are prime, origin, tax, fee are non prime.

Definition - A relation is in 3rd

normal form (3NF) if:

a) it is in 2NF and b) it has no transitive dependencies of nonprime attributes on keys (i.e. A --> B, B --> C means A --> C).

Example employee (emp#, dept, location). Suppose emp# is a key, Employee is not in 3NF because there is a transitive dependency of location on the key, emp#. Emp# --> dept, dept --> location, emp# --> location. In order to convert to 3NF we need to remove or break up the transitive dependencies. Example: if X --> Y and Y --> Z then X and Y will remain on one table with X being the key and Y and Z would be on the other table with Y being the key. All fields dependent on X would be on one table and all fields dependent on Y would

Page 59: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 59

be on another table. Y will be the common field that will be used to join the tables for the running of queries.

From the Employee table above, we would therefore place emp# and dept on one table (key emp#) and dept and location on another table (key dept).

[Research 4NF, BCNF] Comprehensive example (1NF to 3NF)

1NF

S# Status City P# Qty

S1 20 LONDON P1 300 S1 20 LONDON P2 200 S1 20 LONDON P3 400 S1 20 LONDON P4 200 S1 20 LONDON P5 100 S1 20 LONDON P6 100 S2 10 PARIS P1 300 S2 10 PARIS P2 400 S3 10 PARIS P2 200 S4 20 LONDON P2 200 S4 20 LONDON P4 300 S4 20 LONDON P5 400

S# + P# is a key, S# --> status, S# --> city, city --> status Problems

• We cannot insert the fact that a supplier is in a given city until he supplies at least one part

• Deletion of a row for a given supplier destroys additional info

• Redundancy can result in long searches and inconsistency (if change one row have to make same change in another). Example: Suppose supplier S1 changes status to 30, then all 6 rows would have to be modified

2NF To change to 2NF we ensure that everything is fully dependent on the key. Fields that are not fully dependent on the key should be moved to a separate table. Only fields fully dependent on the key should remain in the original table. The fields status and city should therefore be placed in their own table and the key for that table is the field that they are functionally dependent on (which is S#).

Page 60: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 60

S# Status City S# P# Qty

S1 20 LONDON S1 P1 300 S2 10 PARIS S1 P2 200 S3 10 PARIS S1 P3 400 S4 20 LONDON S1 P4 200 S5 30 ATHENS S1 P5 100

S1 P6 100 S2 P1 300 S2 P2 400 S3 P2 200 S4 P2 200 S4 P4 300 S4 P5 400

Note - We can now enter the fact that supplier S5 is located in Athens. Problems

• We cannot enter the fact that a given city has a given status until a supplier is located in that city.

• If we delete the only row for a city we destroy the fact that a city has a given status value.

• Status value occurs many times. Hence search and consistency problems. 3NF Remove transitive dependencies S# --> city and city --> status

S# P# Qty S# City City Status

S1 P1 300 S1 LONDON ATHENS 30 S1 P2 200 S2 PARIS LONDON 20 S1 P3 400 S3 PARIS PARIS 10

S1 P4 200 S4 LONDON S1 P5 100 S5 ATHENS

S1 P6 100 S2 P1 300 S2 P2 400 S3 P2 200 S4 P2 200 S4 P4 300 S4 P5 400

Page 61: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 61

Another example of the process. Repeating groups are listed in parentheses (part a). The table has large rows due to the repeating groups and wastes disk space when an order has only one item. How do you identify repeating groups. Consider the Order table. For every order there will only be one order number and date. Other items will be repeated. For example, for a particular order we will have more than one product number, product name, quantity ordered etc. This is because you are able to order more than one thing. Thus Product# to Vendor Name is shown twice to facilitate two products. To normalize the data from 0NF to 1NF (first normal form), you remove the repeating groups (fields 3 through 7 and 8 through 12) and place them in a second table (part b). You then assign a primary key to the second table (Line Item), by combining the primary key of the nonrepeating group (Order #) with the primary key of the repeating group (Product #). Primary keys are underlined to distinguish them from other fields. To further normalize the database form 1NF to 2NF (second normal form), you remove partial dependencies. A partial dependency exists when fields in the table depend on only part of the primary key. In the Line Item Table (part b), Product Name is dependent on Product #, which is only part of the primary key. Second normal form requires you to place the product information in a separate Product table to remove the partial dependency (part c). To move from 2NF to 3NF(third normal form), you remove transitive dependencies. A transitive dependency exists when a nonprimary key field depends on another nonprimary field. As shown part c, Vendor Name is dependent on Vendor #, both of which are nonprimary key fields. If Vendor Name is left in the Order table, the database will store redundant data each time a product is ordered from the same vendor. Third normal form requires Vendor Name to be placed in a separate Vendor table, with Vendor # as the primary key. The field that is the primary key in the new table - in this case, Vendor # - also remains in the original table as a foreign key and is identified by a dotted underline (part d). In 3NF, the database now is well organized into four separate tables and is easier to maintain. For instance, to add, delete, or change a Vendor or Product Name, you make the change in just one table. Order Table Order

# Order Date

Product #

Product Name

Qty Ordered

Vendor #

Vendor Name

Product #

Product Name

Qty Ordered

Vendor # Vendor Name

1001 6/8/2004 605 White Copy Paper

2 321 Hammermill

203 CD Jewel Cases

5 110 Fellowes

1002 6/10/2004 751 Ballpoint pens

6 166 Pilot

1003 6/10/2004 321 Ring

Binder

12 450 Globe

1004 6/11/2004 605 White

Copy Paper

2 321 Hammer

mill

102 File

Folders

2 450 Globe

Page 62: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 62

Order Order # Order Date

1001 6/8/2004

1002 6/10/2004

1003 6/10/2004

1004 6/11/2004

Line Item Order#

Product #

Qty Ordered Vendor#

1001605` 2 321

1001203 5 110

1002751 6 165

1003321 12 450

1004605 2 321

1004102 2 450

Product Product # Product Name

1002 File Folders

203 CD Jewel Cases

321 Ring Binder

605 White Copy Paper

751 Ballpoint pens

Vendor Vendor # Vendor Name

110 Fellowes

166 Pilot

321 Hammermill

450 Globe

a) Zero Normal Form (0NF) (Order #, Order Date, (Product #, Product Name, Quantity Ordered, Vendor #, Vendor Name)) b) First Normal Form (1NF)

Order (Order #, Order Date) Line Item (Order # + Product #, Product Name, Quantity Ordered, Vendor #, Vendor Name) c) Second Normal Form (2NF)

Order (Order #, Order Date) Line Item (Order # + Product #, Quantity Ordered, Vendor #, Vendor Name) Product (Product #, Product Name) d) Third Normal Form (3NF) Order (Order #, Order Date) Line Item (Order# +Product #, Quantity Ordered, Vendor #) Product (Product #, Product Name) Vendor (Vendor #, Vendor Name)

Page 63: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 63

Normalization Exercises to 3NF.

Exercise 1 - PatientDrug Table Structure

PatientID

Patient Name

Drug Trade Name

Formulation

Size Dose Frequency Side Effect Drug Trade Name

Formulation

Size Dose Frequency Side Effect

9876765 Brown, Karen

Triceptan

Tegretol Tablets 100mg 30mg Once a day Stomach Cramps

Hatceptan

Smithcline Capsules 200mg 30mg Once a day

7654433 Green, Ann Tavegyl Antihista

mine

Liquid 200ml 10ml Twice a

day

Headache

9876567 Dunn, Mary Clidets Cyomisti

n

Ointmen

t

100ml 2ml Every two

hours

Kidney

damage

8768888 Allen, Oscar Ventolin Inhalador

Gas 20oz 1oz Once a day Panadeine

PanadolET Tablets 100mg 5mg Twice a day Indigestion

9877771 Jones, Bob Panadeine

PanadolET

Tablets 100mg 5mg Twice a day

Indigestion

6512334 Harris, Kay Tavegyl Antihista

mine

Liquid 200ml 10ml Twice a

day

Headache

The key is PatientID & Drug The FDs are: PatientID --> PatientName PatientID, Drug --> Dose PatientID, Drug --> Frequency Drug --> TradeName Drug --> Formulation Drug --> Size Drug --> SideEffect Size --> SideEffect

Page 64: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 64

Exercise 2 – Sales Table

Salesman# Salesman Name

Sales Area

Customer# Customer Name

Warehouse# Warehouse Location

Sales Amount

3462 Walters, Kevin

West 18765 Delta Services

4 Fargo 13, 540

3462 Walters, Kevin

West 18830 Levy & Sons

3 Bismarck 10,600

4578 Allen, Ian East 32112 Johnsons 5 Goshen 14,800

1111 Matthews, Joan

West 98787 Facey 4 Fargo 45,890

1111 Matthews, Joan

West 98799 Webster’s Inc

7 Portsmouth 34,877

6765 Brown, Johnathan

North 87889 Taino Limited

2 Ferry 40,000

The key is Salesman# and Customer# The FDs are: Salesman# --> Salesman Name Salesman# --> Sales Area Customer# --> Customer Name Customer# --> Warehouse# Customer# --> Warehouse Location Salesman#, Customer# --> Sales Amount Warehouse# --> Warehouse Location

Page 65: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 65

Assessment of file layouts as they affect the functioning of a database.

It is important to evaluate the performance characteristics of the physical model before implementing the database. Once the database is installed it is difficult or impossible to redesign it. The performance parameters normally used are the space estimates and time estimates. Both of these parameters are predictable. The database designer should therefore try to optimize the physical model for space and time considerations.

Note trade-offs between space and time - I/O can be reduced if some redundant data

is carried, but not having redundant data can save space but cost more time.

Physical and logical data organization.

Logical Physical

Simplicity is important Complex organizations may be important. Software hides the complexity.

Data independence is of prime importance. (This gives the DBA the freedom to change both the physical and logical aspects of the database system without disturbing the applications built on the database.)

Data independence is of little concern if facilities are provided for restructuring the physical data.

Application program requests correspond to the logical data structure. Program does not care about physical layout of data.

Application programs requests are usually unrelated to data storage.

Efficient use of storage is of a little concern. E.g. 1 file vs 2 files etc.

Efficient use of storage is of major concern.

High level of redundancy often exists between logical files.

Elimination of redundancy is an objective of physical organization.

Logical organization must be stable so that programs do not have to be re-written.

Physical layout may be changeable, designed for periodic reorganization.

Means of finding/addressing data does not have a major effect on logical structures.

Addressing techniques have a major effect on physical storage layout. Methods of locating data depends on how data is physically laid out.

E.g. Name, id#, address, subject, grade E.g. Name, Id#, address in one file Id#, subject, grade in another file

Page 66: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 66

UNIT III: INTRODUCTION TO RELATIONAL ALGEBRA AND SQL

The languages used in database systems

A 4GL (4th generation language) is a product that aids the development of new systems. They are called 4th generation because they work at higher level than normal high level languages such as COBOL or Pascal. Most 4GLs make use of relational databases, which themselves have query languages which perform operations at a very high level. Some 4GLs are actually the combination of a database query language and other facilities.

Features of a 4GL

• Defines data

• Define what processing must be performed on the data

• Define report or screen format

• Define input data and validation checks

• Handle user queries

The role of Relational DMLs and DDLs.

Some databases have their own computer languages associated with them, which allow the user to access and retrieve data. Other databases are only accessed via third generation languages.

Data descriptions must be standardized, for this reason Data Description Language (DDL) is provided which must be used to specify the data in the database. Similarly, a Data Manipulation Language (DML) is provided which must be used to access the data. The combination of the DDL and DML is often called a Data Sub-Language (DSL) or a query language.

Data Definition Language - The DDL is that portion of the DBMS, which allows us to create and modify the structure of the database and the database tables. The functions of a DDL may therefore include:

� Creating Database structures � Creating table structures � Associating fields with table structures � Associating data types with field structures etc.

Data Manipulation Language - The DML is that portion of the DBMS, which allows us to store, modify, and retrieve data from the database. There are two types of DMLs: procedural DML and the nonprocedural DML.

• Procedural DMLs require that the user specify the data that is needed from the database and how to obtain it

Procedural DMLs are more difficult to use since they require that the user be proficient in using the language commands to manipulate the structure and the contents of the data file. On the other hand they are more flexible since they allow the user to determine the method that is used for accessing and manipulating the structure and contents of a file.

Page 67: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 67

• ·Nonprocedural DMLs require that the user specify the data that is needed from the database, but it does not allow the user to tell how to obtain it

Nonprocedural DMLs are easier to use since they do not require a detailed knowledge of the language commands, which are needed to manipulate the structure and the contents of a data file. On the other hand they lack flexibility since the programmer has no way of determining the method for accessing and manipulating the contents of the data file. Please note that it is the nonprocedural DML of a 4th Generational Language that allows it to exhibit structural and data independence.

Query Language

The implementation of a query language is very vital for a DBMS. The query language allows the end user to generate adhoc queries, which are immediately answered. In most languages the DML and the query language are one and the same. Today, many DBMS also provide support for a standardized query language that may be different from the DML of the language. This is known as the Structured Query Language (SQL).

The difference between relational algebra and relational calculus.

Query languages can roughly be divided into two types:

• Relational algebra - allows the user to explicitly describe how to find the answer to the query. Uses specific operators to apply to tables. The operators are join, projection, selection, union, set difference.

• Relational calculus - queries describe a desired set of tuples by specifying a predicate the tuples must satisfy. The user describes the answer but does

not give the algorithm for finding it. notation for formulating the definition of that desired relation.

Page 68: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 68

Relational algebra

Relational Algebra is: • the formal description of how a relational database operates • an interface to the data stored in the database itself • the mathematics which underpin SQL operations

This section uses the sample tables below along with others to demonstrate how to solve relational algebra problems.

A B R S

a b c b e d a x x d a f d a f q y z c b d a z

Simple projection

Πx,y (A) = Produces output showing only certain attributes (x, y) of table A.

Selection

σ x = 7 (A) Produces a subset of rows that match/satisfy a criteria (field x = 7). Please note that projection and selection can be combined.

Πx,y ( σ x = 7 (A) ) OR σ x = 7 ( Πx,y (A) )

Difference (or Set Difference)

A - B = rows in A but not in B abc cbd

Renaming

A rename is a unary operation written as ρa / b(R) where the result is identical to R except that the b field in all tuples is renamed to an a field. This is simply used to rename the attribute of a relation or the relation itself.

Union

for relations with same arity (number of attributes) A U B = all rows appearing in both A and B without repeating duplicates. abc daf cbd bed

Intersection

A ∩ B = Builds a relation consisting of all tuples appearing in both files. daf

Division

Takes 2 relations, one binary, one unary and builds a relation consisting of all values of one attribute of the binary relation that match (in the other attribute) all values in the unary relation. R divided by S by matching x to x and z to z. Answer = a from other field.

Page 69: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 69

Another Example of Division

Join (natural, equi, inner, outer)

A B = Builds a relation consisting of all possible concatenated pairs of tuples one from each of the 2 files.

• Natural join - don’t repeat common field.

• Opposite of natural join is the equi-join

• θ-join – using conditions

• Outer Join - include rows in table A with no match. There are three forms of the outer join, depending on which data is to be kept.

o LEFT OUTER JOIN - keep data from the left-hand table o RIGHT OUTER JOIN - keep data from the right-hand table o FULL OUTER JOIN - keep data from both tables

• Opposite of the outer join is the regular/semi-join/inner. The semi-join is joining similar to the natural join and written as R S where R and S are relations. The result of the semi-join is only the set of all tuples in R for which there is a tuple in S that is equal on their common attribute names.

• The antijoin, written as R S where R and S are relations, is similar to the natural join, but the result of an antijoin is only those tuples in R for which there is NOT a tuple in S that is equal on their common attribute names.

Page 70: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 70

Example of Natural Join

Example of θ-join

Consider tables Car and Boat which list models of cars and boats and their respective prices. Suppose a customer wants to buy a car and a boat, but she doesn't want to spend more money for the boat than for the car. The θ-join on the relation CarPrice ≥ BoatPrice produces a table with all the possible options.

Example of a semijoin

Example of Left Outer Join

Page 71: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 71

Example of Right Outer Join

Example of Full Outer Join

Example of an antijoin

Page 72: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 72

Cartesian product.

The Cartesian Product is also an operator which works on two sets. It is sometimes called the CROSS PRODUCT or CROSS JOIN. It combines the tuples of one relation with all the tuples of the other relation.

Cartesian Product Example

Page 73: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 73

Relational Algebra Exercises

Exercise 1

Key Club Table

IdNumber Firstname Lastname Age Sex 452145 John Jones 18 M

785475 Heather Coombs 22 F

745874 Michelle Gentles 20 F

745888 Keith Smith 25 M

888999 Ingrid Harris 30 F

Student Council Table

IdNumber Firstname Lastname Age Sex

785475 Heather Coombs 22 F

745874 Michelle Gentles 20 F

362121 Philip Cameron 19 M

Math Grades Table

IdNumber Grade

452145 56

785475 99

745874 82

745888 65

888999 70

Scholarship Grades Table

Grade

99

82

i) Key Club ∪ Student Council

ii) Key Club ∩ Student Council iii) Key Club - Student Council

iv) ∏ Firstname, Age (Key Club)

v) ∏ Firstname, Lastname (σ Age < 21 (Student Council))

vi) Math Grades ÷ Scholarship Grades

Page 74: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 74

Exercise 2 - Dec 2001 Past Paper Question 5

Given the files below, give the results for the relational algebra. [20 marks]

ICEPStudents ComputerStudents Idnumber Name ClassCode Idnumber Name ClassCode

5 Karen Henry 2S 4 Ellen Albright MIS 9 Crystal Adobe CSS 9 Crystal Adobe CSS 16 Donna Building 1D 22 Peter Rock CSO Classes FinalYearClasses ClassCode ClassName ClassCode

CSS Cert in Computing 3D 1D Year 1 Comp Major 3D Year 3 Comp Major

a) ICEPStudents ∪ ComputerStudents

b) ICEPStudents ∩ ComputerStudents c) ICEPStudents – ComputerStudents

d) Classes ÷ FinalYearClasses e) Name, ClassCode (ComputerStudents)

f) σ Idnumber > 6 (ICEPStudents)

g) Π Name, ClassCode (σ Idnumber > 6 (ComputerStudents))

h) σ Idnumber > 6 (Π Name (ICEPStudents))

i) ICEPStudents Classes (Equi, Regular)

j) ICEPStudents Classes (Outer, Natural)

Exercise 3

Employees Retired Employees

EMPNO NAME JOBNO EMPNO NAME JOBNO 111 Adams 34 456 Gregg 23

234 Henry 23 789 Jones 12 456 Gregg 23 369 Wilson 56

121 Brown 78

Jobs Insured Jobs

JOBNO JOBTITLE JOBNO 12 Mason 23

23 Carpenter 34 Plumber

a) NAME, JOBNO (Retired Employees) [3 marks]

b) JOBNO, EMPNO (σ JOBNO > 30 (Employees) ) [3 marks]

c) Employees ∪ Retired Employees [3 marks] d) Retired Employees – Employees [3 marks]

e) Employees ∩ Retired Employees [3 marks]

f) σ EMPNO > 200 (Employees) [3 marks]

g) Jobs ÷ Insured Jobs [3 marks] h) Employees Jobs (Outer, Natural) [4 marks]

Page 75: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 75

Exercise 4

a) Which relational algebra operation is unary? b) If a Cartesian product is done from one table to itself, how would you prevent

duplicate field names?

Page 76: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 76

SQL Commands – LAB PORTION

What is SQL?

Abbreviation of structured query language, and pronounced either see-kwell or as separate letters. SQL is a standardized query language for requesting information from a database. The original version called SEQUEL (structured English query

language) was designed by an IBM research center in 1974 and 1975. SQL was first introduced as a commercial database system in 1979 by Oracle Corporation.

Historically, SQL has been the favorite query language for database management systems running on minicomputers and mainframes. Increasingly, however, SQL is being supported by PC database systems because it supports distributed databases (databases that are spread out over several computer systems). This enables several users on a local-area network to access the same database simultaneously.

Although there are different dialects of SQL, it is nevertheless the closest thing to a standard query language that currently exists. In 1986, ANSI approved a rudimentary version of SQL as the official standard, but most versions of SQL since then have included many extensions to the ANSI standard. In 1991, ANSI updated the standard. The new standard is known as SAG SQL. Please note that SQL command syntax varies slightly from one DBMS to the other. Please note that even though SQL is done in the lab, you are required to know the

syntax by heart for the written final exam.

Oracle command

• At command line type CONNECT

• User Name SYSTEM

• Password ADMIN MySQL command

• Start run cmd <enter>

• Mysql –u gcampbell –p –h exedvhost1 [can use 10.10.5.141 instead of host name]

• Pwd gcampbell

Brief Summary of Commands

1. Data Manipulation

Projection and Selection

SELECT [field(s)] FROM [file(s)] WHERE [condition] ORDER BY [field(s)] GROUP BY [field] HAVING [condition]

[fields] * all fields field1, field2, …. Fieldn substr(field, 1,4) count(*) count(distinct dept) distinct(dept) sum(salary) also avg, min, max amount * 10 [files]

join

SELECT a.field, b.field OR SELECT file1.field, file2.field

Page 77: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 77

WHERE a.field = b.field Union

SELECT stmt 1 UNION ALL SELECT stmt 2 Modification

UPDATE file SET field1 = value, field2 = field2 +20 WHERE [condition] DELETE FROM file WHERE [condition] INSERT INTO file VALUES (x, y, z) INSERT INTO file SELECT stmt … WHERE Clause

Field IN (‘A’, ‘B’, ‘C’) Dept LIKE (“A%”) Dept [NOT] LIKE (“E_”) Dept between ‘A’ and ‘C’ Salary < 200 OR/AND sex =”F” (>, <>, =, >=, <=) ORDER BY Clause

ORDER BY name DESC, age ASC OR ORDER BY 2 (i.e. second field) HAVING Clause

Used with a GROUP BY. Sets conditions for summary (grouped) data. E.g. HAVING count(*) > 3 2. Data Definition

CREATE TABLE file (field1 CHAR (5) NOT NULL, field2 INT, field3 DEC(5,2)) CREATE [UNIQUE] INDEX indexname ON file (field1 ASC, field2 DESC) CREATE VIEW viewname (field1, field2, field3) AS SELECT stmt … ALTER TABLE file ADD field CHAR(5) DROP TABLE file DROP INDEX indexname on tablename DROP VIEW viewname Control

GRANT SELECT ON file to PUBLIC REVOKE SELECT ON file FROM PUBLIC COMMIT ROLLBACK MySQL data types

Auto_increment Char Boolean Data Dec/Decimal Double Double precision Float Int/Integer

Page 78: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 78

CREATE TABLE (using constraints – primary key, foreign key)

The SQL command for creating an empty table has the following form: create table <table> ( <column 1> <data type> [not null] [unique] [<column constraint>], . . . . . . . . . <column n> <data type> [not null] [unique] [<column constraint>], [<table constraint(s)>] ); For each column, a name and a data type must be specified and the column name must be unique within the table definition. Column definitions are separated by comma. There is no difference between names in lower case letters and names in upper case letters. In fact, the only place where upper and lower case letters matter are strings comparisons. A not null constraint is directly specified after the data type of the column and the constraint requires defined attribute values for that column, different from null. The keyword unique specifies that no two records can have the same attribute value for this column. Unless the condition not null is also specified for this column, the attribute value null is allowed and two tuples having the attribute value null for this column do not violate the constraint. Example: The create table statement for the EMP table has the form

create table EMP ( EMPNO number(4) not null, ENAME varchar2(30) not null, JOB varchar2(10), MGR number(4), HIREDATE date, SAL number(7,2), DEPTNO number(2) );

NB: Except for the columns EMPNO and ENAME null values are allowed.

Oracle offers the following basic data types: • char(n): Fixed-length character data (string), n characters long. The

maximum size for n is 255 bytes (2000 in Oracle8). Note that a string of type char is always padded on right with blanks to full length of n. (+ can be memory consuming). Example: char(40)

• varchar2(n): Variable-length character string. The maximum size for n is 2000 (4000 in Oracle8). Only the bytes used for a string require storage. Example: varchar2(80)

• number(o, d): Numeric data type for integers and reals. o = overall number of digits, d= number of digits to the right of the decimal point.

• Maximum values: o =38, d= −84 to +127. Examples: number(8), number(5,2)

• Note that, e.g., number(5,2) cannot contain anything larger than 999.99 without resulting in an error. Data types derived from number are int[eger], dec[imal], smallint and real.

• date: Date data type for storing date and time. • The default format for a date is: DD-MMM-YY. Examples: ’13-OCT-94’,

’07-JAN-98’

Page 79: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 79

• long: Character data up to a length of 2GB. Only one long column is allowed per table.

It should be noted that data types vary from one database to another.

The definition of a table may include the specification of integrity constraints. Basically two types of constraints are provided: column constraints are associated with a single column whereas table constraints are typically associated with more than one column. However, any column constraint can also be formulated as a table constraint. The specification of a (simple) constraint has the following form: [constraint <name>] primary key | unique | not null A constraint can be named. It is advisable to name a constraint in order to get more meaningful information when this constraint is violated due to, e.g., an insertion of a record that violates the constraint. If no name is specified for the constraint, Oracle automatically generates a name of the pattern SYS C<number>. The two most simple types of constraints have already been discussed: not null and unique. Probably the most important type of integrity constraints in a database are primary key constraints. A primary key constraint enables a unique identification of each record in a table. Based on a primary key, the database system ensures that no duplicates appear in a table. Example: create table EMP ( EMPNO number(4) constraint pk emp primary key, . . . );

For example, for our EMP table in the example above, the specification defines the attribute EMPNO as the primary key for the table. Each value for the attribute EMPNO thus must appear only once in the table EMP. A table, of course, may only have one primary key. Note that in contrast to a unique constraint, null values are not allowed. Example: We want to create a table called PROJECT to store information about projects. For each project, we want to store the number and the name of the project, the employee number of the project’s manager, the budget and the number of persons working on the project, and the start date and end date of the project. Furthermore, we have the following conditions: - a project is identified by its project number, - the name of a project must be unique, - the manager and the budget must be defined. Table definition: create table PROJECT ( PNO number(3) constraint prj pk primary key, PNAME varchar2(60) unique, PMGR number(4) not null, PERSONS number(5),

Page 80: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 80

BUDGET number(8,2) not null, PSTART date, PEND date); A unique constraint can include more than one attribute. In this case the pattern unique(<column i>, . . . , <column j>) is used. If it is required, for example, that no two projects have the same start and end date, we have to add the table constraint. Constraint no same dates unique(PEND, PSTART) This constraint has to be defined in the create table command after both columns PEND and PSTART have been defined. A primary key constraint that includes more than only one column can be specified in an analogous way. Instead of a not null constraint it is sometimes useful to specify a default value for an attribute if no value is given, e.g., when a tuple is inserted. For this, we use the default clause. Example: If no start date is given when inserting a tuple into the table PROJECT, the project start date should be set to January 1st, 1995: PSTART date default(’01-JAN-95’) Examples: Create table Employee (empno int, empname char(40), deptcode char(3), salary number(6,2), dateofbirth date, primary key (empno), constraint EmpC foreign key (deptcode) references DeptTable); CREATE TABLE SUPPLIERS ( SNO CHAR(5), SNAME CHAR(20) NOT NULL, STATUS DEC(3), CITY CHAR(15), PRIMARY KEY ( SNO) ) CREATE TABLE PARTS ( PNO CHAR(6), PNAME CHAR(20), COLOR CHAR(6), WEIGHT DEC(3), CITY CHAR(15), PRIMARY KEY ( PNO ) ) CREATE TABLE INVENTORY ( SNO CHAR(5), PNO CHAR(6), QTY DEC(5), PRIMARY KEY ( SNO, PNO ), FOREIGN KEY ( SNO ) REFERENCES SUPPLIERS, CONSTRAINT FKC FOREIGN KEY ( PNO ) REFERENCES PARTS ) NB. FKC is the name of the constraint

ALTER TABLE

It is possible to modify the structure of a table (the relation schema) even if records have already been inserted into this table. A column can be added using the alter table command alter table <table> add(<column> <data type> [default <value>] [<column constraint>]); Example: Alter table Employees add column nisno char(6);

Page 81: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 81

If more than only one column should be added at one time, respective add clauses need to be separated by colons. A table constraint can be added to a table using alter table <table> add (<table constraint>); Note that a column constraint is a table constraint, too. not null and primary key constraints can only be added to a table if none of the specified columns contains a null value. Table definitions can be modified in an analogous way. This is useful, e.g., when the size of strings that can be stored needs to be increased. The syntax of the command for modifying a column is alter table <table> modify(<column> [<data type>] [default <value>] [<column constraint>]); Example: Alter table Employees modify lastname char(35); [NB. Use alter instead of modify for some DBMS’s] A column can be removed using the following: Alter table <table> Drop column <column>; Examples: Alter table Employees drop column Address3;

ALTER TABLE SUPPLIERS ADD COLUMN STATE CHAR(15) ALTER TABLE SUPPLIERS DROP COLUMN CITY ALTER TABLE SUPPLIERS ADD TRN INT ALTER TABLE PARTS ADD DISCOUNT SMALLINT ALTER TABLE PARTS ALTER COLUMN COLOR CHAR(10) [In MySQL] ALTER TABLE PARTS MODIFY COLOR CHAR(10) [in Oracle] ALTER TABLE DROP CONSTRAINT FKC ALTER TABLE STUDENTS ADD CONSTRAINT FKC FOREIGN KEY (DEPTID) REFERENCES DEPARTMENTS

INSERT

The most simple way to insert a record into a table is to use the insert statement insert into <table> [(<column i, . . . , column j>)] values (<value i, . . . , value j>); For each of the listed columns, a corresponding (matching) value must be specified. Therefore an insertion does not necessarily have to follow the order of the attributes as specified in the create table statement. If a column is omitted, the value null is inserted instead. If no column list is given, however, for each column as defined in the create table statement a value must be given. Examples: insert into PROJECT(PNO, PNAME, PERSONS, BUDGET, PSTART) values(313, ’DBS’, 4, 150000.42, ’10-OCT-94’);

Page 82: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 82

or insert into PROJECT values(313, ’DBS’, 7411, null, 150000.42, ’10-OCT-94’, null); If there are already some data in other tables, these data can be used for insertions into a new table. For this, we write a query whose result is a set of records to be inserted. Such an insert statement has the form insert into <table> [(<column i, . . . , column j>)] <query>

Example: Suppose we have defined the following table: create table OLDEMP ( ENO number(4) not null, HDATE date); We now can use the table EMP to insert records into this new relation: insert into OLDEMP (ENO, HDATE) select EMPNO, HIREDATE from EMP where HIREDATE < ’31-DEC-60’;

SELECT (using WHERE, GROUP BY, ORDER BY, HAVING,

aggregate functions, logical operators, comparison

operators)

In order to retrieve the information stored in the database, the SQL query language is used. In SQL a query has the following (simplified) form (components in brackets [ ] are optional): select [distinct] <column(s)> from <table> [ where <condition> ] [ order by <column(s) [asc|desc]> ] Selecting Columns The columns to be selected from a table are specified after the keyword select. This operation is also called projection. For example, the query select LOC, DEPTNO from DEPT; lists only the number and the location for each tuple from the relation DEPT. If all columns should be selected, the asterisk symbol “*” can be used to denote all attributes. The query select * from EMP;

Page 83: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 83

retrieves all records with all columns from the table EMP. Instead of an attribute name, the select clause may also contain arithmetic expressions involving arithmetic operators etc. select ENAME, DEPTNO, SAL* 1.55 from EMP; For the different data types supported in Oracle, several operators and functions are provided: • for numbers: abs, cos, sin, exp, log, power, mod, sqrt, +,−, _, /, . . . • for strings: chr, concat(string1, string2), lower, upper, replace(string, search string, replacement string), translate, substr(string, m, n), length, to date, . . . • for the date data type: add month, month between, next day, to char, . . . Consider the query

select DEPTNO from EMP; which retrieves the department number for each record. Typically, some numbers will appear more than only once in the query result, that is, duplicate result records are not automatically eliminated. Inserting the keyword distinct after the keyword select, however, forces the elimination of duplicates from the query result. It is also possible to specify a sorting order in which the result records of a query are displayed. For this the order by clause is used and which has one or more attributes listed in the select clause as parameter. desc specifies a descending order and asc specifies an ascending order (this is also the default order). For example, the query select ENAME, DEPTNO, HIREDATE from EMP; from EMP order by DEPTNO [asc], HIREDATE desc; displays the result in an ascending order by the attribute DEPTNO. If two records have the same attribute value for DEPTNO, the sorting criteria is a descending order by the attribute values of HIREDATE. For the above query, we would get the following output:

ENAME DEPTNO HIREDATE FORD 10 03-DEC-81

SMITH 20 17-DEC-80

BLAKE 30 01-MAY-81 WARD 30 22-FEB-81

ALLEN 30 20-FEB-81 Selection of Records Up to now we have only focused on selecting (some) attributes of all records from a table. If one is interested in records that satisfy certain conditions, the where clause is used. In a where clause simple conditions based on comparison operators can be combined using the logical connectives and, or, and not to form complex conditions. Conditions may also include pattern matching operations and even subqueries.

Page 84: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 84

Example: List the job title and the salary of those employees whose manager has the number 7698 or 7566 and who earn more than 1500: select JOB, SAL from EMP where (MGR = 7698 or MGR = 7566) and SAL > 1500; For all data types, the comparison operators =, != or <>,<, >,<=, => are allowed in the conditions of a where clause. Further comparison operators are: • Set Conditions: <column> [not] in (<list of values>)

Example: select _ from DEPT where DEPTNO in (20,30); • Null value: <column> is [not] null,

i.e., for a tuple to be selected there must (not) exist a defined value for this column. Example: select _ from EMP where MGR is not null; Note: the operations = null and ! = null are not defined!

• Domain conditions: <column> [not] between <lower bound> and <upper bound> Examples: • select EMPNO, ENAME, SAL from EMP where SAL between 1500 and 2500; • select ENAME from EMP where HIREDATE between ’02-APR-81’ and ’08-SEP-81’;

String Operations In order to compare an attribute with a string, it is required to surround the string by apostrophes, e.g., where LOCATION = ’DALLAS’. A powerful operator for pattern matching is the like operator. Together with this operator, two special characters are used: the percent sign % (also called wild card), and the underline , also called position marker. For example, if one is interested in all records of the table DEPT that contain two Cs in the name of the department, the condition would be where DNAME like ’%C%C%’. The percent sign means that any (sub)string is allowed there, even the empty string. In contrast, the underline stands for exactly one character. Thus the condition where DNAME like ’%C C%’ would require that exactly one character appears between the two Cs. To test for inequality, the not clause is used. Further string operations are: • upper(<string>) takes a string and converts any letters in it to uppercase, e.g., DNAME = upper(DNAME) (The name of a department must consist only of upper case letters.) • lower(<string>) converts any letter to lowercase, • initcap(<string>) converts the initial letter of every word in <string> to uppercase. • length(<string>) returns the length of the string. • substr(<string>, n [, m]) clips out a m character piece of <string>, starting at position n. If m is not specified, the end of the string is assumed. E.g. substr(’DATABASE SYSTEMS’, 10, 7) returns the string ’SYSTEMS’. Aggregate Functions

Page 85: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 85

Aggregate functions are statistical functions such as count, min, max etc. They are used to compute a single value from a set of attribute values of a column:

• count Counting Rows Example: How many records are stored in the relation EMP? select count(*) from EMP; Example: How many different job titles are stored in the relation EMP? select count(distinct JOB) from EMP;

• max Maximum value for a column • min Minimum value for a column

Example: List the minimum and maximum salary. select min(SAL), max(SAL) from EMP; Example: Compute the difference between the minimum and maximum salary. select max(SAL) - min(SAL) from EMP;

• sum Computes the sum of values (only applicable to the data type number) Example: Sum of all salaries of employees working in the department 30. select sum(SAL) from EMP where DEPTNO = 30;

• avg Computes average value for a column (only applicable to the data type number)

Note: avg, min and max ignore tuples that have a null value for the specified attribute, but count considers null values. Joining Tables

Thus far we have only focused on queries that refer to exactly one table. Furthermore, conditions in a where were restricted to simple comparisons. A major feature of relational databases, however, is to combine (join) records stored in different tables in order to display more meaningful and complete information. In SQL the select statement is used for this kind of queries joining relations: select [distinct] [<alias ak>.]<column i>, . . . , [<alias al>.]<column j> from <table 1> [<alias a1>], . . . , <table n> [<alias an>] [where <condition>] The specification of table aliases in the from clause is necessary to refer to columns that have the same name in different tables. For example, the column DEPTNO occurs in both EMP and DEPT. If we want to refer to either of these columns in the where or select clause, a table alias has to be specified and put in the front of the column name. Instead of a table alias also the complete relation name can be put in front of the column such as DEPT.DEPTNO, but this sometimes can lead to rather lengthy query formulations.

Page 86: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 86

Comparisons in the where clause are used to combine rows from the tables listed in the from clause. Example: In the table EMP only the numbers of the departments are stored, not their name. For each salesman, we now want to retrieve the name as well as the number and the name of the department where he is working: select ENAME, E.DEPTNO, DNAME from EMP E, DEPT D where E.DEPTNO = D.DEPTNO and JOB = ’SALESMAN’; Any number of tables can be combined in a select statement. Example: For each project, retrieve its name, the name of its manager, and the name of the department where the manager is working: select ENAME, DNAME, PNAME from EMP E, DEPT D, PROJECT P where E.EMPNO = P.MGR and D.DEPTNO = E.DEPTNO; It is even possible to join a table with itself: Example: List the names of all employees together with the name of their manager: select E1.ENAME, E2.ENAME from EMP E1, EMP E2 where E1.MGR = E2.EMPNO; Explanation: The join columns are MGR for the table E1 and EMPNO for the table E2. The equijoin comparison is E1.MGR = E2.EMPNO.

SELECT sub queries

Up to now we have only concentrated on simple comparison conditions in a where clause, i.e., we have compared a column with a constant or we have compared two columns. As we have already seen for the insert statement, queries can be used for assignments to columns. A query result can also be used in a condition of a where clause. In such a case the query is called a subquery and the complete select statement is called a nested query. A respective condition in the where clause then can have one of the following forms:

1. Set-valued subqueries <expression> [not] in (<subquery>) <expression> <comparison operator> [any|all] (<subquery>) An <expression> can either be a column or a computed value. 2. Test for (non)existence [not] exists (<subquery>)

Page 87: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 87

In a where clause conditions using subqueries can be combined arbitrarily by using the logical connectives and and or. Example: List the name and salary of employees of the department 20 who are leading a project that started before December 31, 1990: select ENAME, SAL from EMP where EMPNO in (select PMGR from PROJECT where PSTART < ’31-DEC-90’) and DEPTNO =20; Explanation: The subquery retrieves the set of those employees who manage a project that started before December 31, 1990. If the employee working in department 20 is contained in this set (in operator), this record belongs to the query result set. Example: List all employees who are working in a department located in BOSTON:

select * from EMP where DEPTNO in (select DEPTNO from DEPT where LOC = ’BOSTON’); The subquery retrieves only one value (the number of the department located in Boston). Thus it is possible to use “=” instead of in. As long as the result of a subquery is not known in advance, i.e., whether it is a single value or a set, it is advisable to use the in operator. A subquery may use again a subquery in its where clause. Thus conditions can be nested arbitrarily. An important class of subqueries are those that refer to its surrounding (sub)query and the tables listed in the from clause, respectively. Such type of queries is called correlated subqueries. Example: List all those employees who are working in the same department as their manager (note that components in [ ] are optional: select * from EMP E1 where DEPTNO in (select DEPTNO from EMP [E] where [E.]EMPNO = E1.MGR); Explanation: The subquery in this example is related to its surrounding query since it refers to the column E1.MGR. A record is selected from the table EMP (E1) for the query result if the value for the column DEPTNO occurs in the set of values select in the subquery. One can think of the evaluation of this query as follows: For each tuple in the table E1, the subquery is evaluated individually. If the condition where DEPTNO in . . . evaluates to true, this tuple is selected. Note that an alias for the table EMP in the subquery is not necessary since columns without a preceding alias listed there always refer to the innermost query and tables.

Page 88: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 88

Conditions of the form <expression> <comparison operator> [any|all] <subquery> are used to compare a given <expression> with each value selected by <subquery>. • For the clause any, the condition evaluates to true if there exists at least on row selected by the subquery for which the comparison holds. If the subquery yields an empty result set, the condition is not satisfied. • For the clause all, in contrast, the condition evaluates to true if for all rows selected by the subquery the comparison holds. In this case the condition evaluates to true if the subquery does not yield any row or value. Example: Retrieve all employees who are working in department 10 and who earn at least as much as any (i.e., at least one) employee working in department 30: select * from EMP where SAL >= any (select SAL from EMP where DEPTNO = 30) and DEPTNO = 10; Note: Also in this subquery no aliases are necessary since the columns refer to the innermost from clause. Example: List all employees who are not working in department 30 and who earn more than all employees working in department 30: select * from EMP where SAL > all (select SAL from EMP where DEPTNO = 30) and DEPTNO <> 30; For all and any, the following equivalences hold: in , = any not in , <> all or != all Often a query result depends on whether certain rows do (not) exist in (other) tables. Such type of queries is formulated using the exists operator. Example: List all departments that have no employees: select * from DEPT where not exists (select * from EMP where DEPTNO = DEPT.DEPTNO); Explanation: For each tuple from the table DEPT, the condition is checked whether there exists a record in the table EMP that has the same department number (DEPT.DEPTNO). In case no such record exists, the condition is satisfied for the tuple under consideration and it is selected. If there exists a corresponding record in the table EMP, the record is not selected. Example: List workers who receive a higher rate than the average hourly rate.

Page 89: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 89

Select empname from employee Where hrly_rate > (select avg(hrly_rate) from employee); Example: List workers who get an hourly rate higher than the average of those workers reporting to the worker’s supervisor? Select a.name from worker a Where a.hrly_rate > (select avg(b.hrly_rate) From worker b Where b.supv_id = a.supv_id);

Operations on Result Sets

Sometimes it is useful to combine query results from two or more queries into a single result. SQL supports three set operators which have the pattern: <query 1> <set operator> <query 2> The set operators are: • union [all] returns a table consisting of all rows either appearing in the result of <query1> or in the result of <query 2>. Duplicates are automatically eliminated unless the clause all is used. • intersect returns all rows that appear in both results <query 1> and <query 2>. • minus returns those rows that appear in the result of <query 1> but not in the result of <query 2>. Example: Assume that we have a table EMP2 that has the same structure and columns as the table EMP: • All employee numbers and names from both tables: select EMPNO, ENAME from EMP union select EMPNO, ENAME from EMP2; • Employees who are listed in both EMP and EMP2: select * from EMP intersect select * from EMP2; • Employees who are only listed in EMP: select * from EMP minus [NB. In other DBMS’s use EXCEPT instead of MINUS] select _ from EMP2; Each operator requires that both tables have the same data types for the columns to which the operator is applied.

Grouping In previous sections we have seen how aggregate functions can be used to compute a single value for a column. Often applications require grouping rows that have certain properties and then applying an aggregate function on one column for each group

Page 90: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 90

separately. For this, SQL provides the clause group by <group column(s)>. This clause appears after the where clause and must refer to columns of tables listed in the from clause. select <column(s)> from <table(s)> where <condition> group by <group column(s)> [having <group condition(s)>]; Those rows retrieved by the selected clause that have the same value(s) for <group column(s)> are grouped. Aggregations specified in the select clause are then applied to each group separately. It is important that only those columns that appear in the <group column(s)> clause can be listed without an aggregate function in the select clause ! Example: For each department, we want to retrieve the minimum and maximum salary. select DEPTNO, min(SAL), max(SAL) from EMP group by DEPTNO; Rows from the table EMP are grouped such that all rows in a group have the same department number. The aggregate functions are then applied to each such group. We thus get the following query result:

DEPTNO MIN(SAL) MAX(SAL)

10 1300 5000 20 800 3000

30 950 2850

Rows to form a group can be restricted in the where clause. For example, if we add the condition where JOB = ’CLERK’, only respective rows build a group. The query then would retrieve the minimum and maximum salary of all clerks for each department. Note that is not allowed to specify any other column than DEPTNO without an aggregate function in the select clause since this is the only column listed in the group by clause (is it also easy to see that other columns would not make any sense). Once groups have been formed, certain groups can be eliminated based on their properties, e.g., if a group contains less than three rows. This type of condition is specified using the having clause. As for the select clause also in a having clause only <group column(s)> and aggregations can be used. Example: Retrieve the minimum and maximum salary of clerks for each department having more than three clerks. select DEPTNO, min(SAL), max(SAL) from EMP where JOB = ’CLERK’ group by DEPTNO having count(*) > 3;

Page 91: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 91

Note that it is even possible to specify a subquery in a having clause. In the above query, for example, instead of the constant 3, a subquery can be specified. A query containing a group by clause is processed in the following way: 1. Select all rows that satisfy the condition specified in the where clause. 2. From these rows form groups according to the group by clause. 3. Discard all groups that do not satisfy the condition in the having clause. 4. Apply aggregate functions to each group. 5. Retrieve values for the columns and aggregations listed in the select clause.

UPDATE

For modifying attribute values of (some) records in a table, we use the update statement: update <table> set <column i> = <expression i>, . . . , <column j> = <expression j> [where <condition>]; An expression consists of either a constant (new value), an arithmetic or string operation, or an SQL query. Note that the new value to assign to <column i> must a the matching data type. An update statement without a where clause results in changing respective attributes of all records tuples in the specified table. Typically, however, only a (small) portion of the table requires an update. Examples: • The employee JONES is transferred to the department 20 as a manager and his salary is increased by 1000: update EMP set JOB = ’MANAGER’, DEPTNO = 20, SAL = SAL +1000 where ENAME = ’JONES’; • All employees working in the departments 10 and 30 get a 15% salary increase. update EMP set SAL = SAL * 1.15 where DEPTNO in (10,30); Analogous to the insert statement, other tables can be used to retrieve data that are used as new values. In such a case we have a <query> instead of an <expression>. Example: All salesmen working in the department 20 get the same salary as the manager who has the lowest salary among all managers. update EMP set SAL = (select min(SAL) from EMP where JOB = ’MANAGER’) where JOB = ’SALESMAN’ and DEPTNO = 20;

Page 92: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 92

Explanation: The query retrieves the minimum salary of all managers. This value then is assigned to all salesmen working in department 20. It is also possible to specify a query that retrieves more than only one value (but still only one record!). In this case the set clause has the form set(<column i, . . . , column j>) = <query>. It is important that the order of data types and values of the selected row exactly correspond to the list of columns in the set clause.

DELETE

All or selected records can be deleted from a table using the delete statement: delete from <table> [where <condition>]; If the where clause is omitted, all records are deleted from the table. An alternative command for deleting all records from a table is the truncate table <table> command. However, in this case, the deletions cannot be undone. Example: Delete all projects (tuples) that have been finished before the actual date (system date): delete from PROJECT where PEND < sysdate; sysdate is a function in SQL that returns the system date. Another important SQL function is user, which returns the name of the user logged into the current Oracle session.

CREATE VIEW NB. Not all DBMS’s (e.g. MS-Access) have this command.

In Oracle the SQL command to create a view (virtual table) has the form create [or replace] view <view-name> [(<column(s)>)] as <select-statement> [with check option [constraint <name>]]; The optional clause or replace re-creates the view if it already exists. <column(s)> names the columns of the view. If <column(s)> is not specified in the view definition, the columns of the view get the same names as the attributes listed in the select statement (if possible). Example: The following view contains the name, job title and the annual salary of employees working in the department 20: Create view DEPT20 as select ENAME, JOB, SAL_12 ANNUAL SALARY from EMP where DEPTNO = 20; In the select statement the column alias ANNUAL SALARY is specified for the expression SAL_12 and this alias is taken by the view. An alternative formulation of the above view definition is

Page 93: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 93

Create view DEPT20 (ENAME, JOB, ANNUAL SALARY) as select ENAME, JOB, SAL _ 12 from EMP where DEPTNO = 20; A view can be used in the same way as a table, that is, records can be retrieved from a view (also respective records are not physically stored, but derived on basis of the select statement in the view definition), or records can even be modified. A view is evaluated again each time it is accessed. In Oracle SQL no insert, update, or delete modifications on views are allowed that use one of the following constructs in the view definition: • Joins • Aggregate function such as sum, min, max etc. • set-valued subqueries (in, any, all) or test for existence (exists) • group by clause or distinct clause In combination with the clause with check option any update or insertion of a row into the view is rejected if the new/modified row does not meet the view definition, i.e., these rows would not be selected based on the select statement. A with check option can be named using the constraint clause.

CREATE INDEX Create [UNIQUE] INDEX <indexname> on <table> (field [ASC/DESC] [, field [ASC/DESC], ...]) [WITH {primary | disallow null | ignore null }] Example: Create UNIQUE index Custid on Customers (CustomerID) with disallow null;

DROP TABLE

A table and its records can be deleted by issuing the command drop table <table> [cascade constraints];

DROP VIEW A view can be deleted using the command delete <view-name>. [NB. Use Drop instead of delete for Oracle]

DROP INDEX Drop index Custid on Customers;

GRANT and REVOKE

Grant <privilege1, ... privilegen> on <table> to <username> Revoke < privilege > on <table> from <username>; GRANT SELECT ON file to PUBLIC REVOKE SELECT ON file FROM PUBLIC Examples of privileges to be granted SELECT, DELETE, INSERT, UPDATE, DROP, CREATE

Page 94: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 94

COMMIT and ROLLBACK

A sequence of database modifications, i.e., a sequence of insert, update, and delete statements, is called a transaction. Modifications of records are temporarily stored in the database system. They become permanent only after the commit command has been issued. As long as the user has not issued the commit statement, it is possible to undo all modifications since the last commit. To undo modifications, one has to issue the rollback command. It is advisable to complete each modification of the database with a commit (as long as the modification has the expected effect). Note that any data definition command such as create table results in an internal commit. A commit is also implicitly executed when the user terminates an Oracle session.

Page 95: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 95

SQL EXERCISES

EXERCISE 1 – CREATE TABLE AND ALTER TABLE

STATEMENTS 1 Create a table called DEPARTMENTS with the following fields:- department 4

characters, deptname 50 characters, depthead 50 characters. The primary key of the table is department. Please note that the deptname field is a compulsory field.

2 Create a table called MorantBayDepts. It has the same structure as Departments.

3 Create a table called STUDENTS with the following fields:- idnum numeric, firstname, lastname each 20 characters, address with 50 characters, telephone long integer, sex 1 character, maritalstatus 1 character, department 4 characters, DOB date, schoolfee currency. The primary key is idnum, the field department should be used to link this table to the departments table. Please also note that the firstname field is a compulsory field. Please name the link so that you can delete it later.

4 You forgot the status field, please add it to the table, it is 10 characters long. 5 You no longer need the field maritalstatus, remove it from the table. 6 You have realized that 20 characters is not enough for the lastname, increase it to 25. 7 Remove the link between the two tables. 8 Add back the link between the two tables. 9 Add back the field marital status

EXERCISE 2 – INSERT, UPDATE, DELETE, SELECT USING

UNION

1. Use the insert command to add data to the departments table 2. Use the insert command to add data to the students table

3. Student with idnum 4 changed address to 9 Brentwood Rd 4. The School board made a ruling that the minimum school fee for all programs is

$10,000. Change the schoolfee to $10,000 for all students whose school fee is less than $10,000.

5. Add $4,000 to the schoolfee of all TVED students. 6. Student with idnumber 12 got married, change her last name to Gordon and her

maritalstatus to M. 7. Add a new record to the Departments table. department is IT, deptname is

Information Technology, depthead is Mr. Davis. 8. Delete all students whose status says GRADUATED 9. Add 3 records to the MorantBayDepts table. 10. Display all records from both Departments and MorantBayDepts 11. Display all records in the departments table that start with the letter C as well as

all records in the MorantBayDepts table. 12. Copy all of the records in the MorantBayDepts to the Departments table. 13. Delete all records from the MorantBayDepts table.

EXERCISE 3 - SELECT STATEMENT 1. All records and all fields in the Students table. 2. All records and all fields in the Departments table. 3. All fields in the students table for those who are in CS department 4. The idnum, firstname and lastname of all students. 5. The idnum, firstname and lastname of all students sorted by lastname

Page 96: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 96

6. The idnum, firstname and lastname of all students sorted by lastname in descending order.

7. The idnum, firstname, lastname and sex of all students 8. The idnum, firstname, lastname and sex of all female students 9. The idnum, firstname, lastname , sex and maritalstatus of all female married students 10. The firstname, lastname, maritalstatus of single and divorced students 11. The firstname, lastname, maritalstatus of those who are not single or divorced students 12. The idnum, firstname, lastname, schoolfee of all female students sorted by schoolfee 13. The lastname, firstname, schoolfee of students with schoolfee greater than $30,000 sorted

by lastname and firstname 14. The lastname, firstname, maritalstatus of students with lastname starting with the letter C 15. The lastname, firstname, maritalstatus of students with lastname not starting with the

letter C 16. The total schoolfee 17. The total schoolfee for each department 18. The total schoolfee for each department where totals exceed 30000 19. The total number of students 20. The average schoolfee

EXERCISE 4 - SELECT STATEMENT USING MORE THAN

ONE TABLE 1. All fields and records in both tables 2. Firstname, lastname, department, deptname, depthead for all Students. 3. Firstname, lastname, department, deptname, depthead for all students in the CS, BA and

HET departments. 4. Firstname, lastname, depthead, maritalstatus of all married students. 5. Firstname, Lastname, deptname of all students whose lastname ends with the letter E. 6. Firstname, lastname, deptname, schoolfee of all students with schoolfee between $50,000

and $80,000 7. Average schoolfee per deptname 8. Average schoolfee per deptname where the average is between $25,000 and $50,000. 9. Total number of students in each deptname 10. Total number of students in each deptname where the department has more than 2

students

EXERCISE 5 – DISTINCT, WILDCARD cont’d, SUB QUERY,

CREATE INDEX, DROP TABLE, DROP INDEX 1. Display the departments in the students table. Display each one only once. 2. Display the lastnames of those with “a” as the second letter. 3. Display the names of all students whose schoolfee is more than the average schoolfee. 4. Display the names of the students whose schoolfee is more than the average schoolfee of

those in the same department. 5. Display the names of the students who are below the average age. 6. Create an index called NAMEIDX on the students table. The index should be on lastname

and firstname. Why would you need to do this? 7. Create a unique index called SEXIDX on the students table. The index should be on sex.

Why do you get an error message? 8. Remove the index 9. Delete the table MorantBayDepts.

EXERCISE 6 – REVIEW OF ALL COMMANDS

Page 97: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 97

WRITE DOWN THE SQL COMMANDS FOR THE FOLLOWING THEN EXECUTE THEM IN ORACLE/MYSQL. Writing the commands before executing them is good practice as you will not have the computer before you in the final examination.

(NB. Please prefix all tablenames, viewnames and indexnames with your initials. E.g.

GCMOVIETYPES)

DATABASE FOR A VIDEO CLUB 1. Create a table called MOVIETYPES with the following fields:- typecode 3 characters,

typename 25 characters. The primary key of the table is typecode. 2. Create a table called OTHERMTYPES with the same structure as MOVIETYPES. 3. Create a table called MOVIES with the following fields:- movienum integer, movietitle,

20 characters, typecode 3 characters, producer 20 characters, rating 2 characters, cost 6 numbers with 2 decimal places, datepurchased date. The primary key is movienum, the field typecode should be the foreign key to the table called MOVIETYPES.

4. You forgot the director field, please add it to the MOVIES table, it is 25 characters long. 5. You no longer need the field producer, remove it from the MOVIES table. 6. You have realized that 20 characters is not enough for the movietitle, increase it to 30. 7. Add the following data to the MOVIETYPES table: [COM, Comedy], [HOR, Horror],

[DRA, Drama], [TRA, Tragedy], [CAR, Cartoon]. 8. Add the following data to the OTHERMTYPES table: [MUS, Musical], [COM,

Comedy], [DOC, Documentary]. 9. Add the following data to the MOVIES table. [123, Finding Nemo, CAR, G, 1500, 01-

JAN-2005, DisneyPixar], [456, Incredibles, CAR, G, 1300, 03-MAR-2006, Pixar], [789, Pursuit of Happyness, DRA, M, 1000, 02-JAN-2007, Steven Speilberg], [111, Free Willy, DRA, G, 900, 01-JAN-1980, John Holt], [222, Dancing with wolves, DRA, R, 1300, 04-OCT-1990, Perry Mason].

10. Add 6 more of your own records to the MOVIES table. 11. Display all records and all fields in the MOVIES table. 12. Display all records and all fields in the MOVIETYPES table. 13. Display all fields in the MOVIES table for those records who are rated G. 14. Display the movienum, movietitle of all movies. 15. Display the first 5 letters of the movietitle of all movies. 16. Display the movietitle, cost, and cost * 10 of all movies. 17. Display the movienum, movietitle of all movies sorted by rating. 18. Display the movienum, movietitle of all movies sorted by rating in descending order. 19. Display the movietitles that end with the letter S. 20. Display the movienum, movietitle of all movietitles that start with the letter F. 21. Display the movienum, movietitle of all movietitles that start with the letter F and cost

less than $2000. 22. Display the movienum, movietitle of all movietitles that start with the letter F or cost less

than $2000. 23. Display the movietitle, cost of all movies that cost between $1200 and $1400. 24. Display the total cost of the movies. 25. Display the average cost of the movies. 26. Display the highest and lowest cost of the movies. 27. Display the total cost for each movie rating. 28. Display the total cost for each movie rating where totals exceed $4000 29. Display the total number of movies. 30. Display the typecodes in the MOVIES table. Display each typecode only once. 31. Display the movietitles of the movies whose cost is more than the average cost. 32. Display all fields and records in both tables. 33. Display the movietitle, typecode and typename of all movies. 34. Display the movietitle, typecode and typename of all movies with typecodes CAR, COM

and HOR.

Page 98: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 98

35. Display the movietitle, typecode and typename of all movies with typecodes CAR, COM and HOR. Include the typecodes from the MOVIETYPES table that did not have a match as well.

36. Change the director of movienum 111 to Robin Givens. 37. Change the price of the movienum 123 to $2500. 38. Increase the price of all movies to $1200 if the price is less than $1200. 39. Delete all movies that are rated R. 40. Display all records from both MOVIETYPES and OTHERMTYPES. 41. Display all records that are common to both MOVIETYPES and OTHERMTYPES. 42. Display the result of MOVIETYPES minus OTHERMTYPES. 43. Create an index called MTITLES on the MOVIES table. The index should be on

movietitle. 44. Remove the index called MTITLES. 45. Remove the table called OTHERMTYPES 46. Create a view called MOVIEV on the MOVIES table. It should only contain movietitle

and rating. 47. Display all of the data in MOVIEV. 48. Remove the view called MOVIEV. 49. Create another user. Give this user SELECT access to your tables. 50. Login as this user and display all fields and records in the tables.

Page 99: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 99

UNIT IV: DISTRIBUTED DATABASES

Characteristics of a distributed database

A centralized system is one in which all of the data is located in a single database at a single site. Users can log in from any location to access the database. A distributed

database is a database that is spread across a network of computers that are

geographically dispersed and connected via communication lines. The database must have a single logical data model. A distributed database is a database that is under the control of a central database management system or distributed database management system (DDBMS) in which storage devices are not all attached to a common CPU. It can also be stored in multiple computers located in the same physical location. Examples are: SDD-1 by Compute Corp of Americs, R* or System R by IBM Research, Distributed Ingres by Univ. of Ca. at Berkeley.

Definition of logical database, local and global application, global intelligence Logical database

Logical databases are programs that read data from database tables.

Users access the distributed database through:

• Local applications - applications which do not require data from other sites.

• Global applications - applications which do require data from other sites.

Global Intelligence

This is a DBMS that manages the distributed database. A distributed database works by using database links. A database link is a pointer that defines a one-way communication path from a database server to another database server. The link pointer is actually defined as an entry in a data dictionary table. To access the link, you must be connected to the local database that contains the data dictionary entry.

A database link connection is one-way in the sense that a client connected to local database A can use a link stored in database A to access information in remote database B, but users connected to database B cannot use the same link to access data in database A. If local users on database B want to access data on database A, then they must define a link that is stored in the data dictionary of database B.

A database link connection allows local users to access data on a remote database. For this connection to occur, each database in the distributed system must have a unique global database name in the network domain. The global database name uniquely identifies a database server in a distributed system. Database server

Database servers are responsible for processing SQL queries that have been generated by the client process, and for returning the results of these queries back to the client process that made the request.

Page 100: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 100

Client-server

A client-server architecture in a distributed database is a network architecture in which each computer or process on the network is either a client or a server or both. Database servers are powerful computers Clients are PCs or workstations on which users run applications. Clients rely on database servers to process their queries. The user will therefore use his client application to run queries. The queries will be sent to the database server, who returns the result to the client.

Assessment of a distributed database versus a loose connection of independent site

1. Data that makes up the logical database is stored at multiple sites connected by a network.

2. At least one application takes a global view of the data. 3. The global application accesses all sites at least once. 4. A global intelligence (i.e. a DBMS) exists over and above all the local

intelligence (i.e. DMBSs). Its job is to manage the distributed database as a whole.

Terms and concepts used in distributed databases

Transparency - Does a user access all of the files in a system in the same manner, regardless of where they reside? Care with a distributed database must be taken to ensure that the distribution is transparent. In other words, users must be able to interact with the system as if it was one logical system. This applies to the systems performance, and methods of access amongst other things. The users should not need to know at which site any given piece of data is stored. In other words, a distributed system should look like a centralized system to the user. Transactions are transparent – each transaction must maintain database integrity across multiple databases. Transactions must also be divided into sub-transactions, each sub-transaction affecting one database system.

A DDBMS must provide certain transparency features, which will serve to hide the complexities of the distributed database from the end user. In other word the DDPMS should make the user think that he/she is working with a centralized database- These transparency features are listed below:

• Distribution transparency - this means the user should not know that the data is portioned, that it is replicated or where it is located.

• Transaction transparency - this enables a transaction to update data at several locations, in addition if all the locations are not updated then- the transaction- is cancelled and the data reverts to its original state.

• Failure transparency - if one machine fails, the system should still continue to operate without the user being aware that something had gone wrong.

• Performance transparency - the performance of the system should not suffer because of the distributed design (in terms of network Congestion etc-)

• Heterogeneity transparency - the system should allow the integration of various DBMS without the user being aware of all these issues.

Homogeneous distributed database – All of the sites use the same DBMS (e.g. Oracle).

Page 101: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 101

Heterogeneous distributed database – Uses multiple DBMS’s. In other words, the different sites do not have to use the same DBMS (e.g. Oracle and MS-SQL and Postgresql).

The data may be distributed in several ways using the following database

concepts:

Fragmentation - Describes how a single table/file is divided among network sites.

There are three types of fragmentation, these are as follows: a) Horizontal - contains all the attributes/fields and a subset of the

tuples/rows/records b) Vertical - contains a subset of the columns/fields/attributes and all

the rows/records c) Mixed – database is fragmentation horizontally and vertically. (in

other words, subsets of rows and columns).

Table Replication - Determines the distribution of tables around the network. Some tables exist at only one site, while others have been duplicated at various sites (e.g. frequently used files that are basically static – such as a code file).

Reasons for replication

a) To maximize local availability of data b) To provide backup copies of tables in case a particular network fails.

Replication can introduce integrity problems. For example, data can be changed at one site, but the duplicate site has not been changed. For frequently updated tables, replication degrades database performance as all copies of table must be updated regularly to maintain integrity. Three replication conditions exist: full replication, partial replication or partial replication or no replication

• Full replication - all database fragments are replicated.

• Partial replication - only some of the database fragments are replicated.

• No replication - each database fragment is stored at the same location.

Allocation - combines fragmentation and replication.

Advantages and disadvantages of a distributed database

Advantages

• Reflects organizational structure – database fragments are located in the departments they relate to

• Local Processing and Autonomy - Allows local groups (departments) to have control over their own data. Certain processing can go on at one site and other processing at other sites thereby speeding up processing. (Parallel processing).

• Cost Reduction/Economics - Less transmission of data so communication costs down as data closer to locations where originate. It also costs less to create a network of smaller computers with the power of a single large computer.

Page 102: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 102

• Data and load sharing – Each site does its own processing rather than overloading one site. This leads to improved performance – data is located near the site of greatest demand, and the database systems themselves are parallelized, allowing load on the databases to be balanced among servers. (A high load on one module of the database won’t affect other modules of the database in a distributed database.)

• Improved Availability and Reliability - If one site fails, data may be on another site. A fault in one database system will only affect one fragment, instead of the entire database.

• Security - If fire/sabotage of a site then data available on other site.

• Capacity and incremental growth - There is no one machine that can hold all of the data. If it becomes necessary to expand the system then it is easier to add a new computer than upgrade one computer.

• Efficiency and flexibility – If data is stored close to its normal point of use then response times and communication cost will be reduced.

• Modularity – systems can be modified, added and removed from the distributed database without affecting other modules (systems).

Disadvantages

• Distributed execution - The distributed DBMS needs to synchronize and control processes on the various computers on network. It is difficult to maintain integrity because enforcing integrity over a network may require too much networking resources to be feasible.

• Distributed transaction management is hard to control.- Need for concurrency control and recovery mechanisms to process updates across the network and restores consistency after a crash. It is harder to recover from backups. A difficulty may arise if one site holding a copy is not available at the time of the update. One solution is to designate one copy as the primary copy. This site is responsible for broadcasting the updates

• Catalog management is more difficult. The database catalog consists of metadata in which definitions of database objects such as tables, views (virtual tables), indexes, and user groups are stored.

• Distributed DBMS schema management is very difficult - A distributed DBMS needs data about the distributed database to manage it. Such schemas must be stored and managed in a distributed fashion - very difficult.

• Complexity — Extra work must be done by the database administrator (DBA) to ensure that the distributed nature of the system is transparent. Extra work must also be done to maintain multiple disparate systems, instead of one big one. Extra database design work must also be done to account for the disconnected nature of the database — for example, joins become prohibitively expensive when performed across multiple systems.

• Economics — Increased complexity and a more extensive infrastructure means extra labour costs.

• Security — Remote database fragments must be secured, and they are not centralized so the remote sites must be secured as well. The infrastructure must also be secured (e.g. by encrypting the network links between remote sites).

Page 103: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 103

• Inexperience — distributed databases are difficult to work with, and as a young field there is not much readily available experience on proper practice.

Practice Questions

1. Hewlett Limited has a distributed database. One of their sites burnt to the ground. What advantage does Hewlett Limited have in this case?

2. PQHG Limited has millions and millions of records in their database. These

records need to be processed. Do you think it is better to place all of the records on one computer to be processed or is it better to let several computers share the load?

3. Geo Systems Limited has a table that contains the fields TRN, name, address,

gender and date of birth. The table is duplicated on two different sites. Mary got married and changed her last name. Karen changed her address. Mary’s name change was made at Site A by the site manager, but he could not make the update on Site B because of a network problem. Karen’s address was changed at Site B by the site manager but he could not make the same change on Site A because of the same network problem. That night, both sites did a backup. The next morning both systems crashed. The database administrator now needs to do a restore. Which version of the table is the correct one?

4. Osbourne Inc has 2 sites, one in Kingston and the other in Montego Bay. The

distributed database has a table with the fields TRN, name, address, gender, occupation and salary. The fields TRN, name, address and gender are located in Kingston while TRN, name, occupation and salary are located in Montego Bay. The payroll officer, Mr. Brown, who deals with salaries is located in Montego Bay. He needs to create 2 queries. Query 1 shows names and addresses of employees and Query 2 shows names and salaries of employees. Which query does Mr. Brown need to use the network for? Which query allows Mr. Brown to access files locally? Should there be a difference in the way he runs or accesses either query? What is transparency? Mr. Brown executes Query 2 very often and Query 1 very rarely. Would you redistribute the fields or do you feel that the existing location is fine?

5. Which do you think is cheaper, a distributed database or a non-distributed

(centralized) database? Give reasons for your answer.

6. What are the advantages of a distributed database?

7. What are the disadvantages of a distributed database?

8. How does a distributed database work?

Page 104: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 104

Data warehouse

The need for data analysis. Organizations tend to grow and prosper as they gain a better understanding of their environment. Typically, business managers must be able to track daily transactions to evaluate how the business is performing. By tapping into the operational database, management can develop strategies to meet organizational goals. In addition, data analysis can provide information about short-term tactical evaluations and strategies such as: are our sales promotions working? What market percentage are we controlling? Are we attracting new customers? Tactical and strategic decisions are also shaped by constant pressure from external and internal forces, including globalization, the cultural and legal environment and, perhaps most important, technology. Given the many and many and varied competitive pressures, managers are always looking for competitive advantages through product development, service, marketing and so on. Managers understand that their business climate is very dynamic, thus mandating their prompt reaction to change in order to remain competitive. In other words, the decision making cycle time is reduced. In addition, the modern business climate requires managers to approach increasingly complex problems based on a rapidly growing number of internal and external variables. There is therefore growing interest in creating support systems, dedicated to facilitating quick decision making in a complex environment. Different managerial levels require different decision support needs. For example, transaction processing systems, based on operational databases, are tailored to serve the information needs of people who deal with short term inventory, accounts payable or purchasing. Middle level managers, general managers, vice-presidents and presidents focus on strategic and tactical decision making. Such managers require detailed information designed to help them make decisions in a complex data and analysis environment.

Data warehousing Downloading does move data closer to the user and thereby increase its potential utility. Unfortunately, while one or two download sites can be managed without a problem, if every department wants to have its own source of downloaded data, the management problems become immense. Accordingly, organizations began to look for some means of providing a

standardized service for moving data to the user and making them more useful. That service is called data warehousing.

What is a data warehouse?

A data warehouse (DW) is a huge database that stores and manages the data required to

analyze historical and current transactions. A data warehouse contains a wide variety of data that present a coherent picture of business conditions at a single point in time. A data warehouse includes not only data but also tools, procedures, training, personnel and other resources that make access to the data easier and more relevant to decision makers. The goal of the data warehouse is to increase the value of the organization’s data asset. It typically has a user-friendly interface so users easily can interact with its data. It is designed to support management decision making. Through a data warehouse, managers and other users access transactions and summaries transactions quickly and efficiently. The databases in a data warehouse usually are quite large. Development of a data warehouse includes development of systems to extract data from operating systems plus installation of a warehouse database system that provides managers flexible access to the data.

Page 105: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 105

Figure 1 – A Data Warehouse (DW)

The role of the data warehouse is to store extracts from operational data and make them available to users in a useful format. The data can be extracts from databases and files, but can also be document images, recordings, photos and other non-scalar data. The source data could also be purchased from other organizations. The data warehouse stores the extracted data and also combines it, aggregates3 it, transforms it and makes it available to users via tools that are designed for analysis and decision making such as OLAP (see section “What is On-line analytical processing (OLAP)?”

below).

Evolution of the data warehouse The origins of today’s Data Warehouses can be traced to the reporting systems that were popular in the 1980s. These reporting systems provided some basic answers to the end user’s questions, although the format wasn’t always the most appropriate. The end user’s questions, although the format wasn’t always the most appropriate. The reporting systems that formed the foundation of basic decision support required direct access to the operational data through a menu interface to yield predefined report structures. Typically, the reporting system was front-ended by a text-only presentation tool. The next development stage produced a sophisticated form of decision support by supplying lightly summarized data extracted form the operational database. Such lightly summarized data were usually stored in an RDBMS and were accessed through SQL statements via a query tool. The SQL-based query tool provided some predefined reports and, better yet, some ad hoc query capability. Unfortunately, to use the queries the end user had to know the details of the underlying data structure. The presentation tool was similar to the one used by the

3 A collection of, or the total of, disparate elements

Page 106: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 106

original reporting system, but it did provided additional customization options for ad hoc reports. A variation on this theme of greater end user empowerment was the use of spreadsheets or statistical packages to analyze operational data. End users used their own desktop tools to access and manipulate data in order to support their decision making process.

Primitive as they were by current standards, these reporting systems and their extensions gave IS departments the first major tools with which to solve decision support problems. Given advances in hardware and software in the late 1980s and early to mid-1990s, the explosion of available operational data, and the growing sophistication of decision support systems, data warehouse developments were almost inevitable.

Differences between data warehouse and operational database Characteristic Operational database data Data warehouse data Integrated Similar data can have different

representations or meanings Provide a unified view of all data elements with a common definition and representation for all departments.

Subject-Oriented Data are stored with a functional or process orientation (for example, invoices, credits, debits etc).

Data are stored with a subject orientation that facilitates multiple views for data and decision making (e.g. sales, products, sales by products etc.)

Time-Variant Data represent current transactions (e.g. the sales of a product in a given data).

Data are historic in nature. A time dimension is added to facilitate data analysis and time comparisons.

Non-volatile Data updates and deletes are very common.

Data cannot be changed. Data are only added periodically from operational systems. Once data are stored, no changes are allowed.

Components of a data warehouse

• Data extraction tools

• Extracted data

• Metadata4 of warehouse contents

• Warehouse DBMS(s) and OLAP (online analytical processing) servers

• Warehouse data management tools

• Data delivery programs

• End-user analysis tools

• User training courses and materials

• Warehouse consultants

The source of the warehouse is operational data or data generated from routine transaction processing systems such as Sales, Registration of a student, Payroll, Banking deposit/withdrawal etc. The data warehouse therefore needs tools for extracting the data and storing them. These data however are not useful without metadata and describe the nature of the data, their origins, their format, limits on their use and other characteristics of the data that influence the way they can and should be used.

4 Data about the data such a field names, field types, validation rules etc).

Page 107: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 107

Potentially, the data warehouse contains billions of bytes of data in many different formats. Accordingly, it needs DBMS and OLAP servers of its own to store and process the data. In fact, several DBMS and OLAP products may be used, and the features and functions of these may be augmented by additional in-house developed software the reformats, aggregates5, integrates and transfers data from one processor to another within the data warehouse. Programs may be needed to store and process non-scalar data like graphics and animations also. Because the purpose of the data warehouse is to make organizational data more available, the warehouse must include tools not only to deliver the data to the users but also to transform the data for analysis, query and reporting, and OLAP for user-specified aggregation and dis-aggregation. The data warehouse provides an important, but complicated set of resources and services. Hence the warehouse needs to include training courses, training materials and on-line help utilities, and other similar training products to make it easy for users to take advantage of the warehouse resources. Finally, the data warehouse includes knowledgeable personnel who can serve as consultants.

User requirements for a data warehouse The requirements for a data warehouse are different from the requirements for a traditional database application. For one, a typical database application, the structure of reports and queries is standardized. While the data in a report or query may vary from month to month, for instance, the structure of the report or query stays the same. Data warehouse users, on the other hand, often need to change the structure of

queries and reports. Another difference is that users want to do their own data aggregation

6. For example, a user who wants to investigate the impact of different marketing campaigns may want to aggregate product sales according to package color at one time; according to marketing program at another time; according to package color within marketing program at a third time. The analyst wants the same data in each report; but simply presents it differently.

Data warehouse users also want to dis-aggregate them in their own terms, or drill

down their data. For example, a user may be presented with a screen that shows total product sales for a given year. The user may then want to be able to click on the data and have them explode into sales by month; to click again and have the data explode into sales by product by month or sales by region by product by month. Graphical output is another common requirement. Users want to see results of geographic data in geographic form. Sales by state and province should be shown on a map. A reshuffling of employees and offices should be shown on a diagram of office space. These requirements are more difficult because they vary from user to user and from task to task. Many users of data warehouse facilities want to import warehouse data into domain-specific programs. For example, financial analysts want to import data into their

5 A collection of, or the total of, disparate elements 6 To collect or total disparate elements

Page 108: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 108

spreadsheet models and into more sophisticated financial analysis programs. Portfolio managers want to import data into portfolio management programs, and oil drilling engineers want to import data into seismic analysis programs. All of this importing usually means that the warehouse data needs to be formatted in specific ways.

Rules for defining a data warehouse. The following list is made up of 12 rules that define a data warehouse. This list was created by William H. Inmon and Chuck Kelley in 1994.

1. The data warehouse and operational environments are separated.

2. The data warehouse data are integrated.

3. The data warehouse contains historical data over a long time horizon.

4. The data warehouse data are snapshot data captured at a given point in time.

5. The data warehouse data are subject-oriented.

6. The data warehouse data are mainly read-only periodic batch updates from operational data. No online updates are allowed.

7. The data warehouse development life cycle differs from classical systems development. The data warehouse development is data driven; the classical approach is process driven.

8. The data warehouse contains data with several levels of detail: current details data, old detail data, lightly summarized, and highly summarized data.

9. The data warehouse environment is characterized by read-only transactions to very large data sets. The operational environment is characterized by numerous update transactions to a few data entities at a time.

10. The data warehouse environment has a system that traces data sources, transformations and storage.

11. The data warehouse’s metadata7 are a critical component of this environment. The metadata identify and define all data elements. The metadata provide the source, transformation, integration, storage, usage, relationships, and history of each data element.

12. The data warehouse contains a charge-back mechanism for resource usage that enforces optimal use of the data by end users.

The 12 rules capture the data warehouse life cycle, from its introduction as an entity separate from the operational data store, to its components, functionality, and management processes. The current generation of specialized decision support systems provides a comprehensive infrastructure to design, develop, implement and use decision support systems within an organization.

Data mart Some organizations decide to limit the scope of the warehouse to more manageable chunks. A

data mart is a smaller version of a data warehouse, containing a database that helps a specific group or department make decisions. Marketing and sales departments may have their own separate data marts. Individual groups or departments often extract data from the data warehouse to create their data marts.

7 Data about the data such a field names, field types, validation rules etc).

Page 109: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 109

Restricting a data mart to a particular type of data makes the management of the data warehouse simpler and probably means that an off-the-shelf DBMS product can be used to manage the data warehouse. Metadata8 is also simpler and easier to maintain. A data mart that is restricted to a particular business function, such as marketing analysis, may have many types of data and metadata to maintain, but all of those data serve the same type of users. Tools for managing the data warehouse and for providing data to the users can be written with an eye toward the requirements that marketing analysts are likely to have. A data mart that is restricted to a particular business unit or geographical area may have many types of input and many types of users, but the amount of data to be managed is less than for the entire company. There will also be fewer requests for service, so the data warehouse resources can be allocated to fewer users. The following diagram summarizes the scope of alternatives for sharing data. Data downloading is the smallest and easiest alternative. Data are extracted from operational systems and delivered to particular users for specific purposes. The downloaded data are provided on a regular and recurring basis, so the structure of the application is fixed, the users are well trained, and problems such as timing and domain inconsistencies are unlikely to occur because users gain experience working with the same data. At the other extreme, a data warehouse provides extensive types of data and services for both recurring and ad hoc requests. Data marts fall in the middle. As we move from left to right, the alternatives become more powerful but also more expensive and difficult to create.

Data Marts

Data Downloading

Particular Data Inputs

Particular Business Functions

Particular Business Unit or Geographical Region

Data Warehouse

Easier More Difficult Figure 2 - Continuum of Enterprise Data Sharing

On-line analytical processing

What is On-line analytical processing (OLAP)?

OLAP refers to an advanced data analysis environment that supports decision making,

business modelling, and operations research activities. OLAP systems share four major characteristics, these are:

1. Use multidimensional data analysis techniques 2. Provide advanced database support 3. Provide easy-to-use end user interfaces 4. Support client/server architecture

OLAP is an approach to quickly answer multi-dimensional analytical queries. OLAP is part of the broader category of business intelligence, which also encompasses relational reporting and data mining. The typical applications of OLAP are in business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas. The term

8 Data about the data such a field names, field types, validation rules etc).

Page 110: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 110

OLAP was created as a slight modification of the traditional database term OLTP (Online Transaction Processing).

Databases configured for OLAP use a multidimensional data model, allowing for complex analytical and ad-hoc queries with a rapid execution time. They borrow aspects of navigational databases and hierarchical databases that are faster than relational databases. The following shows the difference between the operational view of sales data and the multidimensional view of sales data. Operational View

INVOICE Table LINE Table

Number Date Customer Amount Number Product Price Quantity 2034 15/5/96 Dartonik $3500 2034 Mouse $150 20 2035 15/5/96 INC $1800 2034 Diskette $50 10

2036 16/5/96 Dartonik $2000

2037 16/5/96 INC $800

Multidimensional View

Time Dimension

Customer Dimension 15/5/96 16/5/96 Totals

Dartonik $3500 $2000 $5500 INC $1800 $800 $2600

Totals $5300 $2800 $8100 Sales figures occur at the intersection of a customer row and time column

Practice Questions

1. What is the difference between operational data and a data warehouse? 2. Explain the components of a data warehouse. 3. What is OLAP? 4. Draw an example of a Multidimensional View of the data in the Education data

warehouse.

Data mining Often, the database is distributed. Data warehouses often use a process called data mining. Data mining is a process that often is used by data warehouses to find patterns and relationships among data. E.g. A state government could mine through data to check if the number of births has a relationship to income level. Many e-commerce sites use data mining to determine customer preferences.

Examples of data mining findings can be:

• 65% of customers who did not use their credit card in the last six months are 88% likely to cancel their account

• 82% of customers who bought a new TV 27” or larger are 90% likely to buy and entertainment center within the next four weeks

• If age < 30 and income <= 25000 and credit rating < 3 and credit amount > 25000 then the minimum loan term is 10 years.

Page 111: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 111

Transactions – Atomic, Consistent, Isolated, Durable (ACID)

An understanding of transactions is essential to the database designer especially if he/she is designing a multiuser database. A transaction may be defined as being a group of data modifications that must be performed entirely or not at all. All transactions must adhere to the ACID test:

• Atomic - this property states that the transaction must be completed in its entirety or not at all.

• Consistent - this property states that the transaction should never leave the database in an inconsistent state. This property ensures that the integrity rules and business are not violated.

• Isolated - this property states that the data that is being used by a transaction is not accessible until the transaction has been completed.

• Durable - this property states that the data modification is permanent once the transaction has been completed and if the transaction is not completed then the system should remain in its original state.

Concurrency control

The concept of concurrency control is very important when designing multiuser databases. Concurrency control is the process of coordinating the simultaneous executions of transactions within a multiuser environment. The simultaneous execution of transactions becomes problematic, only if the transactions are attempting to access or modify the same data. If concurrency control is not enforced at this point, then data inconsistencies may occur during the process of data modification. The concept of isolation is what makes concurrency control possible. Remember, isolation states that a transaction has exclusive rights to the data being modified.

Conflict Table

Transactionl Transaction2 Result

Read Read No conflict Read Write Conflict Write Read Conflict Write Write Conflict

Lock Level

In order to accomplish isolation the DBMS makes it possible to perform a lock on a data item. A lock is a mechanism that guarantees exclusive use of a data item. We can have several types of locks:

• ·Database locks - all the tables within the database are exclusive to the current transaction.

Page 112: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 112

• ·Table locks - all the rows and columns within a table is exclusive to the current transaction.

• ·Row locks - the selected rows are exclusive to the current transaction.

• ·Column locks - the selected columns are exclusive to the current transaction.

Lock Type

Irrespective of the lock level, the DBMS may impose different lock types on the data item. The most common are exclusive locks and shared locks. Both of these locks are example of binary locks. A binary lock only has two states: locked or unlocked. With this method each transaction must impose a lock on the data item being accessed and must release the lock once the transaction has been completed. * An exclusive lock exists when the data item is available only to a single transaction. The problem with an exclusive lock is that the DBMS will not allow two or more transactions to the access same data item for reading, at the same time. * A shared lock is one that allows two or more transaction to access the same data item for reading purposes.

Transaction Logs

The DBMS uses a transaction log to keep track of all the data modifications, which are performed by each transaction. The DBMS will then use this information to ensure that each transaction is durable (made permanent). A typical transaction log will store the following pieces of information:

• · The start of a transaction

• · The name of table being modified

• · The primary key of the record being modified

• · The field that is being modified

• · The before and after value of the field being modified

• The end of the transaction When a system failure occurs the transaction log- is checked to see which transactions were completed and which transactions were not. If the transaction were completed then the DBMS would ensure the durability of the system by ensuring that the after values are permanent. If the transaction was not completed then the system would ensure the durability of the system by ensuring that the before values are permanent.

Page 113: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 113

UNIT V: SECURITY ISSUES

The role of the Data Dictionary

The DBMS makes use of descriptions of data items provided by the DDL. This is data about data (meta-data). Metadata describes the structure and format of the data and the overall database. System tables store metadata. Contents include: number of tables and table names, number of fields and field names, field types, field lengths, key fields, field descriptions, files, cross references, error checks e.g. range etc. The DD helps a database user in:

• Communicating with other users

• Controlling data elements (add fields, change descriptions, formatting). Maintaining standards.

• Determining the impact of changes to data elements on the total database

• Centralizing the control of data elements as an aid in database design and in expanding the design.

• Data validation

What is data security?

In the computer industry, data security refers to techniques for ensuring that data stored in a computer cannot be read or compromised by any individuals without authorization. Most security measures involve data encryption and passwords. Data encryption is the translation of data into a form that is unintelligible without a deciphering mechanism. A password is a secret word or phrase that gives a user access to a particular program or system. [Research – Protection vs Security]

What are Security Risks?

A computer or data security risk is any event or action that could cause a loss of or damage to computer hardware, software, data, information, or processing capability.

Security risks fall into 6 main categories, they are as follows: � Human error � Technical error � Virus, worm, Trojan horse � Natural disasters etc � Unauthorized use and access � Theft and vandalism

Sources of incorrect data:-

• Accidents - mistyping input or programming errors

• Malicious use of the database

• System problems - disk crash etc.

Page 114: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 114

Database protection involves:

• Integrity preservation - concerns non malicious errors and their prevention.

• Security (Access control) - concerned with restricting certain users so they are allowed to access and/or modify only a subset of the database.

Security risks and their effects

1. Human error

Humans make mistakes. Examples of mistakes made include:

• Deleting a file by accident

• Formatting a hard drive

• Adding data twice

• Entering incorrect data

• The computer is being misused by someone who is not adequately trained/experienced (e.g. young child)

The effects of human error include:

• Loss of data

• Less data integrity (incorrect data) therefore incorrect information will be retrieved

• Physical damage to computer due to improper use

2. Technical error

A technical error is a system failure. The failure could be because of either hardware, software or both. Examples include:

• Hard disk crashing

• Missing or corrupted files (e.g. due to not shutting down properly etc.)

• Computer not booting

• Drives (diskette, CD), not working (e.g. due to dust)

The effects of technical error include:

• Loss of data

• Loss of time in having to re-enter data

• The inability to use certain devices

3. Virus

A virus is computer program that is designed to replicate itself by copying itself into the other programs stored in a computer. It may be benign or have a negative effect, such as causing a program to operate incorrectly or corrupting a computer's memory. In addition to replication, some computer viruses share another commonality: a damage routine that delivers the virus payload. A virus” payload is an action it performs on the infected computer.

The effects of viruses include:

• The computer cannot boot because a boot sector virus has corrupted the boot sector

• Files are erased by the virus

• Hard drive is formatted (all files are therefore lost)

Page 115: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 115

• Files are corrupted by the virus

• Consumption of storage space and memory

• Degrading performance of the computer

It's important to remember that most viruses aren't programmed with destructive intentions. Most simply reproduce without any destructive attack. However, these viruses can cause damage to your files, particularly since many of the viruses are poorly written programs that can cause unintended software conflicts. At the very least, viruses are intrusive applications that steal storage and CPU cycles without your permission. Most people's worst virus fear is having their hard drive erased, but those who regularly create back-up versions of important data could recover within a few hours. Viruses that subtly corrupt data are potentially much more destructive - computer users may not notice their presence until a great deal of data has been ruined. Some viruses insert random numbers in spreadsheet applications or system files, or add typos to word processing documents. One particularly nasty virus posted confidential documents in the user's name to Internet newsgroups. [Research

– the different types of viruses]

4. Natural disasters etc

Disasters can cause physical damage to computers, thereby causing loss of the data on the computers.

Examples of disasters (natural and otherwise) include:

• Earthquake

• Hurricane

• Fire

• Flood

• Lightening

• Power surge, low voltage

• Rats, roaches, insects etc.

The effects of disasters include:

• Physical damage to computer

• Loss of data

• Repair bills

5. Unauthorized access and use

Unauthorized access is the use of a computer or network without permission.

Unauthorized access includes:

• Hacker/cracker – A hacker is a slang term for a computer enthusiast, i.e., a person who enjoys learning programming languages and computer systems and can often be considered an expert on the subject(s). Depending on how it used, the term can be either complimentary or derogatory, although it is developing an increasingly derogatory connotation. The pejorative sense of hacker is becoming more prominent largely because the popular press has co-opted the term to refer to individuals who gain unauthorized access

Page 116: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 116

to computer systems for the purpose of stealing and corrupting data. Hackers maintain that the proper term for such individuals is cracker.

• A person accessing someone else’s bank account, email, medical records etc without permission.

Unauthorised use is the use of a computer or its data for unapproved or possibly illegal or unethical activities.

Unauthorized use includes:

• Employees do things to deliberately modify the data such as give themselves a raise

• Taking money from someone’s account

• Checking personal email or playing computer games on company time

• Software piracy - the unauthorized copying of software.

The effects of unauthorized access and use are as follows:

• Loss of sales due to piracy. Competing entity could use data against your company

• Loss of time

• Identity theft

• Also leads to theft of intellectual property9, theft of marketing information (e.g., customer lists, pricing data, or marketing plans), or blackmail based on information gained from computerized files (e.g., medical information, personal history, or sexual preference).

6. Theft and vandalism

A computer can be physically stolen or destroyed. This also causes loss of data.

The effects of theft and vandalism include:

• Loss of computer and data (and time to re-enter etc.)

• Illegal access to files

• Loss of income due to software piracy.

Database protection methods - backup and restore methods

Backup is the key – the ultimate safeguard

Regardless of the precautions that you take, things can still go wrong. Backup is therefore the main risk management solution. A backup is a duplicate of a file, or disk that can be used if the original is lost, damaged, or destroyed. If your computer fails you can restore from the backup. The following describes the different types of backup.

• Full – backup that copies all of the files in a computer (also called archival backup)

9 Intellectual property refers to the category of intangible (non-physical) property comprising primarily copyright, moral rights related to copyrighted materials, trademark, patent and industrial design.

Page 117: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 117

• Incremental – backup that copies only the files that have changed since the last full or last incremental backup

• Differential – backup that copies only the files that have changed since the last full backup

• Selective – backup that allows a user to choose specific files to back up, regardless of whether or not the files have changed since the last backup

• Grandfather, Father, Son (or Three-generation backup) – backup method in which you recycle 3 sets of backups. The oldest backup is called the grandfather, the middle backup is the father and the latest backup is called the son. Each time that you backup you reuse the oldest backup medium. The father then becomes the grandfather, the son becomes the father and the new backup becomes the son. This method allows you to have the last 3 backups at all times.

Integrity Preservation – keys (primary and foreign), data validation, authority levels

Keys

Since primary keys do not allow null or duplicate values, it prevents the data entry person from entering the same record more than once or from entering a record with no unique identifier. Since the primary key automatically sets an index, it also allows the DBMS to locate records faster. The power of a database system comes from its ability to quickly find and bring together information stored in separate tables using queries, forms, and reports. In order to do this, each table should include a field or set of fields that uniquely identifies each record stored in the table.

• Uniqueness of key - This prevents duplication. E.g No two students should have the same id number.

• Referential integrity (must match foreign key) – Ensures that related records in separate tables have a match on the common field.

Data Validation

What is data validation? Data validation is the process of comparing data with a set of rules or values to find out if the data is correct. What is a validation rule? Validation rules, also called validity checks, are checks performed on the data to ensure that the user is entering the correct data. What is the purpose of a validation rule? Validation rules reduce data entry errors. They do this by limiting what the user is allowed to enter in a particular field.

The various types of validity checks include:

• Valid values – List

Page 118: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 118

The data in the field is limited to a certain list of values. For example, sex can only be male or female, marital status can only be single, married, widowed or divorced.

• Range check A range check determines whether a number is within a specified range. (E.g. 5 to 9)

• Alphabetic/numeric check (Data type check) Alphabetic check - Ensures that users enter only alphabetic data into a field. Numeric check - Ensures that users enter only numeric data into a field.

• Field size check Data that is entered into a field can also be limited by the size. For example, your student id number is made up of 6 characters. The user should therefore not be allowed to enter a student id number that has more than 6 characters.

• Consistency check This tests the data in two or more associated fields to ensure that the relationship is logical. For example, the value in a Training_Date field cannot occur earlier in time than the value in the Date_Joined field.

• Completeness check Verifies that a required field contains data. For example, every student must have a first and last name entered.

• Check Digit A number or character that is appended to or inserted into a primary key value. A check digit often confirms the accuracy of a primary key value. Bank account, credit card and other identification numbers often include one or more check digits.

Authority Levels

Authority levels are used to limit access (only certain users can perform certain tasks). This is done for example through login ids and passwords. One user may have Add/Change authority while another has Delete authority.

Security Control – unauthorized access and use, encryption, anti-virus, firewall, SQL views

A security control is an action taken to either prevent a data security risk from happening or to reduce its effects. Security controls help to preserve the integrity of data. Security controls include:

• Unauthorized access is the use of a computer or network without permission. Unauthorized use is the use of a computer or its data for unapproved or possibly illegal activities (e.g. playing games, surfing net on company time).

• Data validation

• Reduction of human interaction (because humans make mistakes). In other words, automate as many processes as possible. For example, use a bar code reader to scan in the items rather than have the cashier typing in the item code

• Training of users so that human error is reduced.

• Supervision of children and inexperienced users.

Page 119: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 119

• Separation of duties (e.g. one person enters and another person is needed to change the data such as a cashier). This is in order to prevent employees from making mistakes, committing fraud or stealing from the company.

• Backup - just in case the hardware fails you, or if you get a virus or other problem that causes loss of files. An offsite backup protects in cases of disaster. An offsite backup is one that is not at the same location as the computer. You can also use mirrored disks in which data is saved to more than one disk, if one disk crashes the other takes over. [Research RAID]

• Buy quality hardware from a reputable dealer to reduce likelihood of hardware failure.

• Get a warranty period when purchase a computer – a computer that has a technical error can therefore be fixed free of cost

• Air conditioning – to keep the computer cool

• Plastic dust covers to keep dust out of diskette drives etc.

• Proper (sturdy) desk on which to store computer

• No magnets/don’t open shutter and other proper diskette care procedures to prevent data from being erased

• Proper maintenance (care) – e.g. defrag, cleaning computer

• Regular testing of hardware and software

• Virus protection - e.g. McAfee, Norton Antivirus. Anti-virus software detects and removes viruses. The software must however be updated regularly as new viruses are invented each day. Write protection of diskettes if not saving (only reading) so as not to get a virus.

• Limit software downloads to reduce the likelihood of getting a virus.

• Use only authorized media for loading data and software.

• Do not open unknown email and attachments to avoid getting a virus.

• Use a firewall - a program and/or hardware that filters the data coming through the internet to prevent unauthorized access. Some firewalls protect systems from viruses, junk email (spam). (e.g. Black Ice, Zone Alarm)

• Place computer site in a good location (e.g. not on a hillside or near the sea)

• Strong, weatherproof facilities (no windows, fireproof)

• No food/drink around the computer – no insects, spills on keyboard etc

• Raised (false) floors – Similar to a false ceiling except this is below your feet. It is used for earthquake protection as it works as a shock absorber. Raised floors also allow you to hide cables below.

• UPS (Uninterruptible Power Supply) – This has a battery which charges while there is power. It gives you time to shut down the computer properly when there is a power cut. This is different from a generator which is used during a power cut and runs on gas. It allows you to continue using the computer for as long as there is gas. The UPS is important because improper shutdown can corrupt files. The UPS also provides protection from power surges.

• Surge protectors to protect against low voltage, power surge/spike, lightening etc.

• Lightening rod to protect the building and all electrical devices within the building from lightening storms.

• Fire extinguishers – specially made for computers (foam). These will not damage the computers whereas water would cause damage.

• Insurance of equipment in order to re-purchase if your computer is destroyed.

• Access codes, passwords – to prevent unauthorised access and use. Use biometric devices – e.g. Retinal scan, finger print scan, voice activated

Page 120: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 120

• Intrusion detection software – detects if you put in the wrong password more than 3 times and kicks you off. (What happens when you. try to put in a false telephone card number, or the wrong PIN for your debit card at the ATM)

• Audit trails and logs - audit trails keep track of what a user does when he is on the system while log systems – keeps track of user sign on/off

• Physical security – e.g. locks, guards, grills etc. Physical isolation of data

• Encryption of data - encoding data so that it means nothing to hackers if they get into the system.

• Time and Location controls – User can only use system at certain times and in certain locations (can’t hide and do wrong things)

• Proper distribution and disposal - reports should be distributed to the correct users; this reduces unauthorised access and use. Shred reports and do not just throw them in the garbage. (e.g. do not throw away credit card statements (prevents persons from going in your garbage and getting your private information).

• Go to reputable web sites so that will not steal credit card number. Go to secure sites (lock at the bottom of the screen).

• Copyright and License agreements – so that you have the right to sue persons who steal your software/data. (Patents/Trademarks)

• Auditing the programs that are written in case an unscrupulous employee deliberately put in code for his benefit.

• Callback systems – the user can connect to the computer only after the computer calls the user back at a previously established telephone number.

• Metal detectors to prevent hardware theft

• Lock the computer to the desk

• Low profile facilities (no overt disclosure of high-value nature of site, in other words do not display a sign to let persons know where your computer facilities are)

• Mark your computers in a secret place so that you can identify it if the police find it. (Keep the receipt/invoice as proof of purchase and to have a record of the serial number).

• Views/Virtual tables – user able to only see certain fields/records, Grant and Revoke – allows users to have only certain types of privileges – e.g. update, select, delete

Page 121: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 121

SAMPLE SQL CODE FOR RECREATING DATABASE

DROP TABLE dmot_depositor; DROP TABLE dmot_borrower; DROP TABLE dmot_account; DROP TABLE dmot_loan; DROP TABLE dmot_branch; DROP TABLE dmot_customer; CREATE TABLE dmot_branch ( branch_name varchar2(20), branch_city varchar2(20), assets number, primary key (branch_name)); CREATE TABLE dmot_customer ( customer_name varchar2(20), customer_street varchar2(20), customer_city varchar2(20), primary key (customer_name)); CREATE TABLE dmot_account ( account_number char(5), branch_name varchar2(20), balance number, primary key (account_number), foreign key (branch_name) references dmot_branch); CREATE TABLE dmot_loan ( loan_number char(5), branch_name varchar2(20), amount number, primary key (loan_number), foreign key (branch_name) references dmot_branch); CREATE TABLE dmot_depositor ( account_number char(5), customer_name varchar2(20), primary key (customer_name, account_number), foreign key (customer_name) references dmot_customer, foreign key (account_number) references dmot_account); CREATE TABLE dmot_borrower ( loan_number char(5), customer_name varchar2(20), primary key (customer_name, loan_number), foreign key (customer_name) references dmot_customer, foreign key (loan_number) references dmot_loan); INSERT INTO dmot_branch VALUES ('Brooklyn Heights', 'Brooklyn', 200000000); INSERT INTO dmot_branch VALUES ('Park Slope', 'Brooklyn', 150000000); INSERT INTO dmot_branch VALUES ('East Village', 'New York', 300000000); INSERT INTO dmot_branch VALUES ('Jamaica', 'Jamaica', 180000000); INSERT INTO dmot_branch VALUES ('SOHO', 'New York', 220000000); INSERT INTO dmot_customer VALUES ('Adams', 'Jay St', 'Brooklyn'); INSERT INTO dmot_customer VALUES ('Bob', '112th St', 'Jamaica'); INSERT INTO dmot_customer VALUES ('Christina', '7th Ave', 'Brooklyn'); INSERT INTO dmot_customer VALUES ('Johnson', 'Broadway', 'New York'); INSERT INTO dmot_customer VALUES ('Joe', 'Park Ave', 'New York'); INSERT INTO dmot_customer VALUES ('Susan', 'Canal St', 'New York'); INSERT INTO dmot_account VALUES ('A-101', 'East Village', 500000); INSERT INTO dmot_account VALUES ('A-102', 'Jamaica', 200000); INSERT INTO dmot_account VALUES ('A-103', 'East Village', 150000); INSERT INTO dmot_account VALUES ('A-104', 'Park Slope', 450000); INSERT INTO dmot_account VALUES ('A-105', 'East Village', 350000);

Page 122: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 122

INSERT INTO dmot_account VALUES ('A-106', 'Brooklyn Heights', 50000); INSERT INTO dmot_account VALUES ('A-107', 'Jamaica', 100000); INSERT INTO dmot_account VALUES ('A-108', 'Park Slope', 220000); INSERT INTO dmot_loan VALUES ('L-101', 'Park Slope', 120000); INSERT INTO dmot_loan VALUES ('L-102', 'SOHO', 200000); INSERT INTO dmot_loan VALUES ('L-103', 'Jamaica', 100000); INSERT INTO dmot_loan VALUES ('L-104', 'Park Slope', 180000); INSERT INTO dmot_loan VALUES ('L-105', 'East Village', 100000); INSERT INTO dmot_loan VALUES ('L-106', 'Jamaica', 150000); INSERT INTO dmot_depositor VALUES ('A-101', 'Susan'); INSERT INTO dmot_depositor VALUES ('A-102', 'Adams'); INSERT INTO dmot_depositor VALUES ('A-103', 'Joe'); INSERT INTO dmot_depositor VALUES ('A-104', 'Bob'); INSERT INTO dmot_depositor VALUES ('A-105', 'Susan'); INSERT INTO dmot_depositor VALUES ('A-106', 'Johnson'); INSERT INTO dmot_depositor VALUES ('A-107', 'Susan'); INSERT INTO dmot_depositor VALUES ('A-108', 'Bob'); INSERT INTO dmot_borrower VALUES ('L-101', 'Joe'); INSERT INTO dmot_borrower VALUES ('L-102', 'Christina'); INSERT INTO dmot_borrower VALUES ('L-103', 'Johnson'); INSERT INTO dmot_borrower VALUES ('L-104', 'Bob'); INSERT INTO dmot_borrower VALUES ('L-105', 'Adams'); INSERT INTO dmot_borrower VALUES ('L-106', 'Bob');

NB. You will need to create a similar text file and execute it each time you need to

recreate your tables and data quickly.

Page 123: Database Management Manual 2010

Database Management

© Copyright G. Campbell 2010 123

REFERENCES

Date, C. J. Introduction to Database Systems. Addition-Wesley. Date, C. J. A Guide to The SQL Standard. 4th Ed. Addison-Wesley.

Entity Relationship Model. [On-line]. Available: http://en.wikipedia.org/wiki/Entity-relationship_diagram. Gertz, Michael. Oracle/SQL Tutorial. Database and Information Systems Group, Department of Computer Science, University of California, Davis, Available: http://www.db.cs.ucdavis.edu. Helman, Paul. The Science of Database Management. Irwin Peter, Hadrian Dr. Database Management Systems Lecture Notes. UWI. Cave Hill. Rob, Peter, Coronel, Carlos. Database Systems: Design, Implementation and

Management. 3rd Ed. Thomson Publishing. Scarlett, H. (2005). Database Management Lecture Notes. Shelly, G., Cashman, T. Discovering Computers 2006. Thomson.