dbms mba notes

UNIT 1

MODULE 1: INTRODUCTIONDatabase systems – Definition – Components – Advantages – Objectives – Evolution.Definition : DBMS

A database management system (DBMS) is a system that allows to build and maintain databases, as well as to utilize their data and retrieve information from it. A DBMS defines the database type that it supports, as well as its functionality and operational capabilities. A DBMS provides the internal processes for external applications built on them.

A DBMS consists of software that operates databases, providing storage, access, security, backup and other facilities to meet needed requirements

A database management system (DBMS) is a collection of programs that manages the database structure and controls access to the data stored in the database. In a sense, a database resembles a very well-organized electronic filing cabinet in which powerful software, known as a database management system, helps manage the cabinet’s contents.Well known DBMSs include Oracle, IBM DB2, Microsoft SQL Server, PostgreSQL, MySQL and SQLite.

Evolution of database and DBMS technology

The database concept has evolved since the 1960s to ease increasing difficulties in designing, building, and maintaining complex information systems (typically with many concurrent end-users, and with a diverse large amount of data). It has evolved together with database management systems which enable the effective handling of databases. Though the terms database and DBMS define different entities, they are inseparable: a database's properties are determined by its supporting DBMS and vice-versa. The Oxford English dictionary cites 1962 technical report as the first to use the term "data-base.”With the progress in technology in the areas of processors, computer memory, computer storage. And computer networks, the sizes, capabilities, and performance of databases and their respective DBMSs have grown in orders of magnitudes. For decades it has been unlikely that a complex information system can be built effectively without a proper database supported by a DBMS. The utilization of databases is now spread to such a wide degree that virtually every technology and product relies on databases and DBMSs for its development and commercialization, or even may have such embedded in it. Also, organizations and companies, from small to large, heavily depend on databases for their operations.

The introduction of the term database coincided with the availability of direct-access storage (disks and drums) from the mid-1960s onwards. The term represented a contrast with the tape-based systems of the past, allowing shared interactive use rather than daily batch processing.

http://en.wikipedia.org/wiki/SQLite

http://en.wikipedia.org/wiki/MySQL

http://en.wikipedia.org/wiki/PostgreSQL

http://en.wikipedia.org/wiki/Microsoft_SQL_Server

http://en.wikipedia.org/wiki/IBM_DB2

http://en.wikipedia.org/wiki/Oracle_Database

In the earliest database systems, efficiency was perhaps the primary concern, but it was already recognized that there were other important objectives. One of the key aim was to make the data independent of the logic of application programs, so that the same data could be made available to different applications.

The first generation of database systems were navigational applications typically accessed data by following pointers from one record to another. The two main data models at this time were the hierarchical model, epitomized by IBM's IMS system, and the Codasyl model (Network model), implemented in a number of products such as IDMS.

The Relational model, first proposed in 1970 by Edgar F. Codd, departed from this tradition by insisting that applications should search for data by content, rather than by following links. This was considered necessary to allow the content of the database to evolve without constant rewriting of applications. Relational systems placed heavy demands on processing resources, and it was not until the mid 1980s that computing hardware became powerful enough to allow them to be widely deployed. By the early 1990s, however, relational systems were dominant for all large-scale data processing applications, and they remain dominant today (2012) except in niche areas. The dominant database language is the standard SQL for the Relational model, which has influenced database languages also for other data models.

Because the relational model emphasizes search rather than navigation, it does not make relationships between different entities explicit in the form of pointers, but represents those rather using primary keys and foreign keys. While this is a good basis for a query language, it is less well suited as a modelling language. For this reason a different model, the Entity-relationship model which emerged shortly later (1976), gained popularity for database design.

In the period since the 1970s database technology has kept pace with the increasing resources becoming available from the computing platform: notably the rapid increase in the capacity and speed (and reduction in price) of disk storage, and the increasing capacity of main memory. This has enabled ever larger databases and higher throughputs to be achieved.

The rigidity of the relational model, in which all data is held in tables with a fixed structure of rows and columns, has increasingly been seen as a limitation when handling information that is richer or more varied in structure than the traditional 'ledger-book' data of corporate information systems: for example, document databases, engineering databases, multimedia databases, or databases used in the molecular sciences. Various attempts have been made to address this problem, many of them gathering under banners such as post-relational or NoSQL. Two developments of note are the Object database and the XML database. The vendors of relational databases have fought off competition from these newer models by extending the capabilities of their own products to support a wider variety of data types.

General-purpose DBMS

A DBMS has evolved into a complex software system and its development typically requires thousands of person-years of development effort] Some general-purpose DBMSs, like Oracle, Microsoft SQL server, and IBM DB2, have been undergoing upgrades for thirty years or more. General-purpose DBMSs aim to satisfy as many applications as possible, which typically make them even more complex than special-purpose databases. However, the fact that they can be used "off the shelf", as well as their amortized cost over many applications and instances, makes them an attractive alternative (Vs. one-time development) whenever they meet an application's requirements.

Though attractive in many cases, a general-purpose DBMS is not always the optimal solution: When certain applications are pervasive with many operating instances, each with many users, a general-purpose DBMS may introduce unnecessary overhead and too large "footprint" (too large amount of unnecessary, unutilized software code). Such applications usually justify dedicated development. Typical examples are email systems, though they need to possess certain DBMS properties: email systems are built in a way that optimizes email messages handling and managing, and do not need significant portions of a general-purpose DBMS functionality.

ADVANTAGES OF DBMS

the DBMS (Database Management System) is preferred ever the conventional file processing system due to the following advantages:

1. Controlling Data Redundancy - In the conventional file processing system, every user group maintains its own files for handling its data files. This may lead to

• Duplication of same data in different files.• Wastage of storage space, since duplicated data is stored.• Errors may be generated due to updating of the same data in different files.• Time in entering data again and again is wasted.• Computer Resources are needlessly used.• It is very difficult to combine information.

2. Elimination of Inconsistency - In the file processing system information is duplicated throughout the system. So changes made in one file may be necessary be carried over to another file. This may lead to inconsistent data. So we need to remove this duplication of data in multiple file to eliminate inconsistency.

For example: - Let us consider an example of student's result system. Suppose that in STUDENT file it is indicated that Roll no= 10 has opted for 'Computer' course but in RESULT file it is indicated that 'Roll No. =l 0' has opted for 'Accounts' course. Thus, in this case the two entries for z particular student don't agree with each other. Thus, database is said to be in an inconsistent state. So to eliminate this conflicting information we need to centralize the database. On centralizing the data base the duplication will be controlled and hence inconsistency will be removed.

Data inconsistency are often encountered in everyday life Consider an another example, w have all come across situations when a new address is communicated to an organization that we deal it (E.g. - Telecom, Gas Company, Bank). We find that some of the communications from that organization are received at a new address while other continued to be mailed to the old address. So combining all the data in database would involve reduction in redundancy as well as inconsistency so it is likely to reduce the costs for collection storage and updating of Data.

Let us again consider the example of Result system.

Suppose that a student having Roll no -201 changes his course from 'Computer' to 'Arts'. The change is made in the SUBJECT file but not in RESULT'S file. This may lead to inconsistency of the data. So we need to centralize the database so that changes once made are reflected to all the tables where a particulars field is stored. Thus the update is brought automatically and is known as propagating updates.

3. Better service to the users - A DBMS is often used to provide better services to the users. In conventional system, availability of information is often poor, since it normally difficult to obtain information that the existing systems were not designed for. Once several conventional systems are combined to form one centralized database, the availability of information and its updateness is likely to improve since the data can now be shared and DBMS makes it easy to respond to anticipated information requests.

Centralizing the data in the database also means that user can obtain new and combined information easily that would have been impossible to obtain otherwise. Also use of DBMS should allow users that don't know programming to interact with the data more easily, unlike file processing system where the programmer may need to write new programs to meet every new demand.

4. Flexibility of the System is Improved - Since changes are often necessary to the contents of the data stored in any system, these changes are made more easily in a centralized database than in a conventional system. Applications programs need not to be changed on changing the data in the database.

5. Integrity can be improved - Since data of the organization using database approach is centralized and would be used by a number of users at a time. It is essential to enforce integrity-constraints.

In the conventional systems because the data is duplicated in multiple files so updating or changes may sometimes lead to entry of incorrect data in some files where it exists.

For example: - The example of result system that we have already discussed. Since multiple files are too maintained, as sometimes you may enter a value for course which may not exist. Suppose course can have values (Computer, Accounts,

Economics, and Arts) but we enter a value 'Hindi' for it, so this may lead to an inconsistent data, so lack of Integrity.

Even if we centralized the database it may still contain incorrect data. For example: -

• Salary of full time employ may be entered as Rs. 500 rather than Rs. 5000.• A student may be shown to have borrowed books but has no enrolment.• A list of employee numbers for a given department may include a number of nonexistent employees.

These problems can be avoided by defining the validation procedures whenever any update operation is attempted.

6. Standards can be enforced - Since all access to the database must be through DBMS, so standards are easier to enforce. Standards may relate to the naming of data, format of data, structure of the data etc. Standardizing stored data formats is usually desirable for the purpose of data interchange or migration between systems.

7. Security can be improved - In conventional systems, applications are developed in an adhoc/temporary manner. Often different system of an organization would access different components of the operational data, in such an environment enforcing security can be quiet difficult. Setting up of a database makes it easier to enforce security restrictions since data is now centralized. It is easier to control that has access to what parts of the database. Different checks can be established for each type of access (retrieve, modify, delete etc.) to each piece of information in the database.

Consider an Example of banking in which the employee at different levels may be given access to different types of data in the database. A clerk may be given the authority to know only the names of all the customers who have a loan in bank but not the details of each loan the customer may have. It can be accomplished by giving the privileges to each employee.

8. Organization's requirement can be identified - All organizations have sections and departments and each of these units often consider the work of their unit as the most important and therefore consider their need as the most important. Once a database has been setup with centralized control, it will be necessary to identify organization's requirement and to balance the needs of the competition units. So it may become necessary to ignore some requests for information if they conflict with higher priority need of the organization.

It is the responsibility of the DBA (Database Administrator) to structure the database system to provide the overall service that is best for an organization.

For example: - A DBA must choose best file Structure and access method to give fast response for the high critical applications as compared to less critical applications.

9. Overall cost of developing and maintaining systems is lower - It is much easier to respond to unanticipated requests when data is centralized in a database than when it is stored in a conventional file system. Although the initial cost of setting up of a database can be large, one normal expects the overall cost of setting up of a database, developing and maintaining application programs to be far lower than for similar service using conventional systems, Since the productivity of programmers can be higher in using non-procedural languages that have been developed with DBMS than using procedural languages.

10. Data Model must be developed - Perhaps the most important advantage of setting up of database system is the requirement that an overall data model for an organization be build. In conventional systems, it is more likely that files will be designed as per need of particular applications demand. The overall view is often not considered. Building an overall view of an organization's data is usual cost effective in the long terms.

11. Provides backup and Recovery - Centralizing a database provides the schemes such as recovery and backups from the failures including disk crash, power failures, software errors which may help the database to recover from the inconsistent state to the state that existed prior to the occurrence of the failure, though methods are very complex.

Disadvantages of Database: Although the database system yields considerable advantages over previous data management approaches, database systems do carry significant disadvantages. For example: 1. Increased costs.Database systems require sophisticated hardware and software and highly skilled personnel. The cost of maintaining the hardware, software, and personnel required to operate and manage a database system can be substantial. Training, licensing, and regulation compliance costs are often overlooked when database systems are implemented.2. Management complexity.Database systems interface with many different technologies and have a significant impact on a company’s resources and culture. The changes introduced by the adoption of a database system must be properly managed to ensure that they help advance the company’s objectives. Given the fact that database systems hold crucial company data that are accessed from multiple sources, security issues must be assessed constantly.3. Maintaining currency.To maximize the efficiency of the database system, you must keep your system current. Therefore, you must perform frequent updates and apply the latest patches and security measures to all components. Because database technology advances rapidly, personnel training costs tend to be significant. Vendor dependence: Given the heavy investment in technology and personnel training, companies might be reluctant to change database vendors. As a consequence, vendors are less likely to offer pricing point advantages to existing customers, and those customers might be limited in their choice of database system components.

4. Frequent upgrade/replacement cycles.DBMS vendors frequently upgrade their products by adding new functionality. Such new features often come bundled in new upgrade versions of the software. Some of these versions require hardware upgrades. Not only do the upgrades themselves cost money, but it also costs money to train database users and administrators to properly use and manage the new features’

The major characteristics of database approach are:

• Self-describing Nature of a Database System

• Insulation between Programs and Data, and Data Abstraction

• Support of Multiple Views of the Data

• Sharing of Data and Multi user Transaction Processing

Self-contained nature

DBMS system contains data plus a full description of the data (called “metadata”) “metadata” is data about data - data formats, record structures, locations, how to access, indexes metadata is stored in a catalogue and is used by DBMS software to know how to access the data. Contrast this with the file processing approach where application programs need to know the structure and format of records and data.

Program-data independence

Data independence is immunity of application programs to changes in storage structures and access techniques. E.g. adding a new field, changing index structure, changing data format, In a DBMS environment these changes are reflected in the catalogue. Applications aren’t affected. Traditional file processing programs would all have to change, possibly substantially.

Data abstraction

A DBMS provides users with a conceptual representation of data (for example, as objects with properties and inter-relationships). Storage details are hidden. Conceptual representation is provided in terms of a data model.

Support for multiple views

DBMS may allow different users to see different “views” of the DB, according to the perspective each one requires. E.g. a subset of the data - For example; the people using the payroll system need not/should not see data about students and class schedules. E.g. data presented in a different form from the way it is stored - For example someone interested in student transcripts might get a view which is formed by combining information from separate files or tables.

Centralized control of the data resource

The DBMS provides centralized control of data in an organization.

This brings a number of advantages:

(a) Reduces redundancy

(b) Avoids inconsistencies

(c) Data can be shared

(d) Standards can be enforced

(e) Security restrictions can be applied

(f) Integrity can be maintained

Redundancy and Inconsistencies

Redundancy is unnecessary duplication of data. For example if accounts department and registration department both keep student name, number and address.

Redundancy wastes space and duplicates effort in maintaining the data.

Redundancy also leads to inconsistency.

Inconsistent data is data which contradicts itself - e.g. two different addresses for a given student number. Inconsistency cannot occur if data is represented by a single entry (i.e. if there is no redundancy).

Controlled redundancy: Some redundancy may be desirable (for efficiency). A DBMS should be aware of it, and take care of propagating updates to all copies of a data item.

This is an objective, not yet currently supported.

Sharing

• Need concurrency control

• Multiple user views

Standards

E.g. data formats, record structures, naming, documentation

International, organizational, departmental ... standards

Security

- restricting unauthorized access

DBMS should perform security checks on all accesses.

Integrity

Maintaining validity of data;

E.g. employee numbers must be in some range

E.g. every course must have an instructor

e.g... Student number must be unique

E.g. hours worked cannot be more than 150

These things are expressed as constraints.

DBMS should perform integrity checks on all updates. Currently DBMSs provide limited integrity checks..

MODULE 2: DBMS Architecture – Associations – Relationships – Mappings between Abstractions –Generalisation – Classifications – Conceptual Data modeling – File Organization – DataStructure – Data models: HDBMS, NDBMS, RDBMS, OODBMS.

DBMS Architecture

Components of a DBMS

Database Management System (DBMS)

DBMSs are the technology tools that directly support managing organizational data. With a DBMS you can create a database including its logical structure and constraints, you can manipulate the data and information it contains, or you can directly create a simple database application or reporting tool. Human administrators, through a user interface, perform certain tasks with the tool such as creating a database, converting an existing database, or archiving a large and growing database. Business applications, which perform the higher level tasks of managing business processes, interact with end users and other applications and, to store and manage data, rely on and directly operate their own underlying database through a standard programming interface like ODBC.

The following diagram illustrates the five components of a DBMS.

Database Engine:

The Database Engine is the core service for storing, processing, and securing data. The Database Engine provides controlled access and rapid transaction processing to meet the requirements of the most demanding data consuming applications within your enterprise. Use the Database Engine to create relational databases for online transaction processing or online analytical processing data. This includes creating tables for storing data, and database objects such as indexes, views, and stored procedures for viewing, managing, and securing data. You can use SQL Server Management Studio to manage the database objects, and SQL Server Profiler for capturing server events.

Data dictionary:

A data dictionary is a reserved space within a database which is used to store information about the database itself. A data dictionary is a set of table and views which can only be read and never altered. Most data dictionaries contain different information about the data used in the enterprise. In terms of the database representation of the data, the data table defines all schema objects including views, tables, clusters, indexes, sequences, synonyms, procedures, packages, functions, triggers and many more. This will ensure that all these things follow one standard defined in the dictionary. The data dictionary also defines how much space has been allocated for and / or currently in used by all the schema objects. A data dictionary is used when finding information about users, objects, and schema and storage structures. Every time a data definition language (DDL) statement is issued, the data dictionary becomes modified.

A data dictionary may contain information such as:

Database design information Stored SQL procedures User permissions User statistics Database process information Database growth statistics Database performance statistics

Query Processor:

A relational database consists of many parts, but at its heart are two major components: the storage engine and the query processor.

The storage engine writes data to and reads data from the disk. It manages records, controls concurrency, and maintains log files.

The query processor accepts SQL syntax, selects a plan for executing the syntax, and then executes the chosen plan. The user or program interacts with the query processor, and the query processor in turn interacts with the storage engine. The query processor

isolates the user from the details of execution. The user specifies the result, and the query processor determines how this result is obtained. The query processor components include: DDL interpreter, DML compiler, Query evaluation engine

Mappings between Abstractions

The conceptual/internal mapping:

o defines conceptual and internal view correspondence o specifies mapping from conceptual records to their stored counterparts

An external/conceptual mapping: o defines a particular external and conceptual view correspondence

A change to the storage structure definition means that the conceptual/internal mapping must be changed accordingly, so that the conceptual schema may remain invariant, achieving physical data independence.

A change to the conceptual definition means that the conceptual/external mapping must be changed accordingly, so that the external schema may remain invariant, achieving logical data independence.

The architecture for DBMSs is divided into three general levels:

external conceptual internal

External View

A user is anyone who needs to access some portion of the data. They may range from application programmers to casual users with adhoc queries. Each user has a language at his/her disposal.

The application programmer may use a high level language ( e.g. COBOL) while the casual user will probably use a query language.

Regardless of the language used, it will include a data sublanguage DSL which is that subset of the language which is concerned with storage and retrieval of information in the database and may or may not be apparent to the user.

A DSL is a combination of two languages:

a data definition language (DDL) - provides for the definition or description of database objects

a data manipulation language (DML) - supports the manipulation or processing of database objects.

Each user sees the data in terms of an external view: Defined by an external schema, consisting basically of descriptions of each of the various types of external record in that external view, and also a definition of the mapping between the external schema and the underlying conceptual schema.

Conceptual View

An abstract representation of the entire information content of the database. It is in general a view of the data as it actually is, that is, it is a `model' of the

`realworld'. It consists of multiple occurrences of multiple types of conceptual record,

defined in the conceptual schema.

To achieve data independence, the definitions of conceptual records must involve information content only.

storage structure is ignored access strategy is ignored In addition to definitions, the conceptual schema contains authorisation and

validation procedures.

Internal View

The internal view is a low-level representation of the entire database consisting of multiple occurrences of multiple types of internal (stored) records.

It is however at one remove from the physical level since it does not deal in terms of physical records or blocks nor with any device specific constraints such as cylinder or track sizes. Details of mapping to physical storage is highly implementation specific and are not expressed in the three-level architecture.

The internal view described by the internal schema:

defines the various types of stored record what indices exist how stored fields are represented what physical sequence the stored records are in

In effect the internal schema is the storage structure definition.

Here we compare these three types of data models. The table below compares the different features:

Feature Conceptual Logical Physical

Entity Names ✓ ✓Entity Relationships ✓ ✓Attributes ✓Primary Keys ✓ ✓Foreign Keys ✓ ✓

Table Names ✓Column Names ✓Column Data Types ✓We can see that the complexity increases from conceptual to logical to physical. This is why we always first start with the conceptual data model (so we understand at high level what are the different entities in our data and how they relate to one another), then move on to the logical data model (so we understand the details of our data without worrying about how they will actually implemented), and finally the physical data model (so we know exactly how to implement our data model in the database of choice). In a data warehousing project, sometimes the conceptual data model and the logical data model are considered as a single deliverable.

DATA MODELS

Hierarchical Model

The hierarchical data model organizes data in a tree structure. There is a hierarchy of parent and child data segments. This structure implies that a record can have repeating information, generally in the child data segments. Data in a series of records, which have a set of field values attached to it. It collects all the instances of a specific record together as a record type. These record types are the equivalent of tables in the relational model, and with the individual records being the equivalent of rows. To create links between these record types, the hierarchical model uses Parent Child Relationships. These are a 1: N mapping between record types. This is done by using trees, like set theory used in the relational model, "borrowed" from maths. For example, an organization might store information about an employee, such as name, employee number, department, salary. The organization might also store information about an employee's children, such as name and date of birth. The employee and children data forms a hierarchy, where the employee data represents the parent segment and the children data represents the child segment. If an employee has three children, then there would be three child segments associated with one employee segment. In a hierarchical database the parent-child relationship is one too many. This restricts a child segment to having only one parent segment. Hierarchical DBMSs were popular from the late 1960s, with the introduction of IBM's Information Management System (IMS) DBMS, through the 1970s.

Network Model

The popularity of the network data model coincided with the popularity of the hierarchical data model. Some data were more naturally modelled with more than one parent per child. So, the network model permitted the modelling of many-to-many relationships in data. In 1971, the Conference on Data Systems Languages (CODASYL) formally defined the network model. The basic data modelling construct in the network model is the set construct. A set consists of an owner record type, a set name, and a member record type. A member record type can have that role in more than one set; hence the multiparent concept is supported. An owner record type can also be a member or owner in another set. The data model is a simple network, and link and intersection record types (called junction records by IDMS) may exist, as well as sets between them. Thus, the complete network of relationships is represented by several pair wise sets; in each set some (one) record type is owner (at the tail of the network arrow) and one or more record types are members (at the head of the relationship arrow). Usually, a set defines a 1: M relationship, although 1:1 is permitted. The CODASYL network model is based on mathematical set theory.

Relational Model

(RDBMS - relational database management system) A database based on the relational model developed by E.F. Codd. A relational database allows the definition of data structures, storage and retrieval operations and integrity constraints. In such a database the data and relations between them are organised in tables. A table is a collection of records and each record in a table contains the same fields.

Properties of Relational Tables: Values Are Atomic Each Row is Unique Column Values Are of the Same Kind The Sequence of Columns is Insignificant The Sequence of Rows is Insignificant Each Column Has a Unique Name

Certain fields may be designated as keys, which means that searches for specific values of that field will use indexing to speed them up. Where fields in two different tables take values from the same set, a join operation can be performed to select related records in the two tables by matching values in those fields. Often, but not always, the fields will have the same name in both tables. For example, an "orders" table might contain (customer-ID, product-code) pairs and a "products" table might contain (product-code, price) pairs so to calculate a given customer's bill you would sum the prices of all products ordered by that customer by joining on the product-code fields of the two tables. This can be extended to joining multiple tables on multiple fields. Because these relationships are only specified at retrieval time, relational databases are classed as dynamic database management system. The RELATIONAL database model is based on the Relational Algebra.

Object/Relational Model

Object/relational database management systems (ORDBMSs) add new object storage capabilities to the relational systems at the core of modern information systems. These new facilities integrate management of traditional fielded data, complex objects such as time-series and geospatial data and diverse binary media such as audio, video, images, and applets. By encapsulating methods with data structures, an ORDBMS server can execute complex analytical and data manipulation operations to search and transform multimedia and other complex objects.

As an evolutionary technology, the object/relational (OR) approach has inherited the robust transaction- and performance-management features of it s relational ancestor and the flexibility of its object-oriented cousin. Database designers can work with familiar tabular structures and data definition languages (DDLs) while assimilating new object-management possibilities. Query and procedural languages and call interfaces in ORDBMSs are familiar: SQL3, vendor procedural languages, and ODBC, JDBC, and proprietary call interfaces are all extensions of RDBMS languages and interfaces. And the leading vendors are, of course, quite well known: IBM, Inform ix, and Oracle.

Object-Oriented Model

Object DBMSs add database functionality to object programming languages. They bring much more than persistent storage of programming language objects. Object DBMSs extend the semantics of the C++, Smalltalk and Java object programming languages to provide full-featured database programming capability, while retaining native language compatibility. A major benefit of this approach is the unification of the application and database development into a seamless data model and language environment. As a result, applications require less code, use more natural data modelling, and code bases are easier to maintain. Object developers can write complete database applications with a modest amount of additional effort.

According to Rao (1994), "The object-oriented database (OODB) paradigm is the combination of object-oriented programming language (OOPL) systems and persistent systems. The power of the OODB comes from the seamless treatment of both persistent data, as found in databases, and transient data, as found in executing programs."

In contrast to a relational DBMS where a complex data structure must be flattened out to fit into tables or joined together from those tables to form the in-memory structure, object DBMSs have no performance overhead to store or retrieve a web or hierarchy of interrelated objects. This one-to-one mapping of object programming language objects to database objects has two benefits over other storage approaches: it provides

higher performance management of objects, and it enables better management of the complex interrelationships between objects. This makes object DBMSs better suited to support applications such as financial portfolio risk analysis systems, telecommunications service applications, World Wide Web document structures, design and manufacturing systems, and hospital patient record systems, which have complex relationships between data.

Data Model: A data model is a collection of conceptual tools for describing data, data relationships, data semantics, and consistency constraints.

Entity: An entity is a “thing” or “object” in the real world that is distinguishable from all other objects. For example, each person in an enterprise is an entity.

Entity set :An entity set is a set of entities of the same type that share the same properties, or attributes. The set of all persons who are customers at a given bank, for example, can be defined as the entity set customer. Similarly, the entity set loan might represent the set of all loans awarded by a particular bank. An entity is represented by a set of attributes. Attributes are descriptive properties possessed by each member of an entity set. The designation of an attribute for an entity set expresses that the database stores similar information concerning each entity in the entity set; however, each entity may have its own value for each attribute. Simple and composite attributes: the attributes have been simple; that is, they are not divided into subparts is called as "simple attributes". on the other hand, can be divided into subparts is called as "composite attributes". For example, an attribute name could be structured as a composite attribute consisting of first-name, middle-initial, and last-name. Single-valued and multivalued attributes: For instance, the loan-number attribute for a specific loan entity refers to only one loan number. Such attributes are said to be single valued. There may be instances where an attribute has a set of values for a specific entity. Consider an employee entity set with the attribute phone-number. An employee may have zero, one, or several phone numbers, and different employees may have different numbers of phones. This type of attribute is said to be multivalued.

Derived attribute: The value for this type of attribute can be derived from the values of other related attributes or entities. For instance, let us say that the customer entity set has an attribute loans-held, which represents how many loans a customer has from the bank. We can derive the value for this attribute by counting the number of loan entities associated with that customer.

Relationship Sets: A relationship is an association among several entities. A relationship set is a set of relationships of the same type.

Mapping Cardinalities: Mapping cardinalities, or cardinality ratios, express the number of entities to which another entity can be associated via a relationship set. Mapping cardinalities are most useful in describing binary relationship sets, although they can contribute to the description of relationship sets that involve more than two entity sets.

One to one. An entity in A is associated with at most one entity in B, and an entity in B is associated with at most one entity in A.

One to many. An entity in A is associated with any number (zero or more) of entities in B. An entity in B, however, can be associated with at most one entity in A.

Many to one. An entity in A is associated with at most one entity in B. An entity in B, however, can be associated with any number (zero or more) of entities in A.

Many to many. An entity in A is associated with any number (zero or more) of entities in B, and an entity in B is associated with any number (zero or more) of entities in A.

Keys: A key allows us to identify a set of attributes that suffice to distinguish entities from each other. Keys also help uniquely identify relationships, and thus distinguish relationships from each other.

superkey: A superkey is a set of one or more attributes that, taken collectively, allow us to identify uniquely an entity in the entity set. For example, the customer-id attribute of the entity set customer is sufficient to distinguish one customer entity from another. Thus, customer-id is a superkey. Similarly, the combination of customer-name and customer-id is a superkey for the entity set customer. The customer-name attribute of customer is not a superkey, because several people might have the same name.

candidate key: minimal super keys are called candidate keys. If K is a super key, then so is any superset of K. We are often interested in super keys for which no proper subset is a super key.It is possible that several distinct sets of attributes could serve as a candidate key. Suppose that a combination of customer-name and customer-street is sufficient to distinguish among members of the customer entity set. Then, both {customer-id} and {customer-name, customer-street} are candidate keys. Although the attributes customer-id and customer-name together can distinguish customer entities, their combination does not form a candidate key, since the attribute customer-id alone is a candidate key.

primary key: which denotes the unique identity is called as primary key. Primary key to denote a candidate key that is chosen by the database designer as the principal means of identifying entities within an entity set. A key (primary, candidate, and super) is a property of the entity set, rather than of the individual entities. Any two individual entities in the set are prohibited from having the same value on the key attributes at the same time. The designation of a key represents a constraint in the real-world enterprise being modeled.

Weak Entity Sets:An entity set may not have sufficient attributes to form a primary key. Such an entity set is termed a weak entity set. An entity set that has a primary key is termed a strong entity set.

For a weak entity set to be meaningful, it must be associated with another entity set, called the identifying or owner entity set. Every weak entity must be associated with an identifying entity; that is, the weak entity set is said to be existence dependent on the identifying entity set. The identifying entity set is said to own the weak entity set that it identifies. The relationship associating the weak entity set with the identifying entity set is called the identifying relationship. The identifying relationship is many to one from the weak entity set to the identifying entity set, and the participation of the weak entity set in the relationship is total. In our example, the identifying entity set for payment is loan, and a relationship loan-payment that associates payment entities with their corresponding loan entities is the identifying relationship. Although a weak entity set does not have a primary key, we nevertheless need a means of distinguishing among all those entities in the weak entity set that depend on one particular strong entity. The discriminator of a weak entity set is a set of attributes that allows this distinction to be made.

In E-R diagrams, a doubly outlined box indicates a weak entity set, and a doubly outlined diamond indicates the corresponding identifying relationship.in fig the weak entity set payment depends on the strong entity set loan via the relationship set loan-payment.

The figure also illustrates the use of double lines to indicate total participation—the of the (weak) entity set payment in the relationship loan-payment is total, meaning that every payment must be related via loan-payment to some loan. Finally, the arrow from loan-payment to loan indicates that each payment is for a single loan. The discriminator of a weak entity set also is underlined, but with a dashed, rather than a solid, line.

Specialization : An entity set may include sub groupings of entities that are distinct in some way from other entities in the set. For instance, a subset of entities within an entity set may have attributes that are not shared by all the entities in the entity set. The E-R model provides a means for representing these distinctive entity groupings. Consider an entity set person, with attributes name, street, and city. A person may be further classified as one of the following: • customer • employee

Each of these person types is described by a set of attributes that includes all the attributes of entity set person plus possibly additional attributes. For example, customer entities may be described further by the attribute customer-id, whereas

employee entities may be described further by the attributes employee-id and salary. The process of designating sub groupings within an entity set is called specialization. The specialization of person allows us to distinguish among persons according to whether they are employees or customers.

Generalization: The design process may also proceed in a bottom-up manner, in which multiple entity sets are synthesized into a higher-level entity set on the basis of common features. The database designer may have first identified a customer entity set with the attributes name, street, city, and customer-id, and an employee entity set with the attributes name, street, city, employee-id, and salary. There are similarities between the customer entity set and the employee entity set in the sense that they have several attributes in common. This commonality can be expressed by generalization, which is a containment relationship that exists between a higher-level entity set and one or more lower-level entity sets. In our example, person is the higher-level entity set and customer and employee are lower-level entity sets. Higher- and lower-level entity sets also may be designated by the terms super class and subclass, respectively. The person entity set is the super class of the customer and employee subclasses .For all practical purposes, generalization is a simple inversion of specialization. We will apply both processes, in combination, in the course of designing the E-R schema for an enterprise. In terms of the E-R diagram itself, we do not distinguish between specialization and generalization. New levels of entity representation will be distinguished (specialization) or synthesized (generalization) as the design schema comes to express fully the database application and the user requirements of the database. Differences in the two approaches may be characterized by their starting point and overall goal. Generalization proceeds from the recognition that a number of entity sets share some common features (namely, they are described

by the same attributes and participate in the same relationship sets).

Aggregation: Aggregation is an abstraction in which relationship sets (along with their associated entity sets) are treated as higher-level entity sets, and can participate in relationships.

Symbols used in the E-R notation:

File Organization

Database needs to be stored on some computer storage medium. Two categories of storage are:

• Primary storage: Includes storage media that can be operated on directly by CPU. Fast access to data. Limited storage capacity.

• Secondary storage: Have large capacity, costs less, slower access to data. Can’t be processed directly by CPU.

A file is a sequence of records. If every record has exactly the same size its said to be made of fixed-length records, if different records in file have different sizes, file is said to be made of variable length records.

Records of a file must be allocated to disk blocks because a block is unit of data transfer between disk and memory. When the block size is larger than the record size, each block will contain numerous records, at times records may not fit in a block. Suppose block size is B bytes, for a file of fixed length records of size R bytes, with B>=R, bfr =└B/R ┘, bfr is blocking factor for the file

Unused space in each block is B – (bfr x R) bytes

• To utilize the unused space, we can store part of a record on one block and the rest on another. A pointer at the end of the first block points to the block containing the remainder of the record, in case it is not the next consecutive block on disk. This organisation is called spanned.

• If records are not allowed to cross block boundaries, the organisation is called unspanned. This is used for fixed-length records, having B > R because it makes each record start at a known location in the block.

• For variable length records, either a spanned or unspanned organization can be used. To calculate the no. of blocks b, b= └(r/bfr) ┘ blocks

Techniques for allocating file blocks on disk:

• Contiguous allocation : File blocks are allocated to consecutive disk blocks. This makes the reading the file faster using double buffering, but file expansion is difficult.

• Linked allocation : Each file block contains a pointer to the next file block. Expansion is easy but reading of file is slower.

• Combination of consecutive disk blocks is called cluster. Clusters are called file segments or extents.

File header has information about a file that is needed by the system programs that access the file records. Gives address of file blocks, field length, order of fields.

To search for a record on disk, one or more blocks are copied into main memory buffers. Programs search for the desired record or records within the buffers, using the information in the file header. If the address of the block that contains the

desired record is not known, the search programs must do a linear search through the file blocks.

Each file block is copied into a buffer and searched either until the record is located or all file blocks have been searched unsuccessfully

Operations on files

• Include retrieval operations and update operations

• Retrieval operations locate records for processing and examining values.

• Update operations change the data in the file

• Eg: Select, Insert, delete ,update

Can use equality or comparison operator

• Open: Prepare files for reading or writing.

• Reset: Set the file pointer of an open file to the beginning of the file.

• Find: Searches for the first record that satisfies a search condition.

• Read: Copies the current record from the buffer to a program variable in the user program.

• FindNext: Searches for the next record in the file.

• Delete: Delete current record

• Modify: Modify some field values for current record.

• Insert: Inserts a new record in the file by locating the block

• Close: Releases buffers and completes file access

• Findall : locate all records in file satisfying a condition

• Find n : Searches for the first record that satisfies a search condition and continues to locate the next n-1 records

• FindOrdered: Retrieves all records in some specified order.

• Reorganize: Reodering records

File organization: Organization of data of a file into records, blocks and access structures.

Access method: Group of operations that can be applied to a file.

Buffering of blocks.

When several blocks need to be transferred from the disk to the main memory and all the block addresses are known, several buffers can be reserved in main memory to speed up the transfer.

While a buffer is being read or written, the CPU can process data in the other buffer. Buffering is useful when processes run concurrently in a parallel fashion.(separate I/o processor or multiple CPU processors exist)

Double buffering: CPU starts processing a block once its transfer to main memory is complete, at the same time disk I/O processor can be reading and transferring the next block into a different buffer. Reduces waiting time

INDEXES

An index is a data structure that organizes data records on disk to optimize certain kinds of retrieval operations.An index allows us to efficiently retrieve all records that satisfy search conditions on the search key fields of the index.Additional indexes on given collection of data records, each with a different search key, to speed up search operations that are not efficiently supported by file organization used to store the data records.

• Eg: store the records in a file organized as an index on employee age, this is an alternative to sorting the file by age

When ordering of data records is same as or close to ordering of data entries in some index, we say that the index is clustered, else it is unclustered index.

An index on a set of fields that includes the primary key is called a primary index, other indexes are called secondary indexes. Two data entries are said to be duplicates if they have the same value for the search key field associated with the index. A primary index is guaranteed to contain duplicates.

Clustered Field

If records of a file are physically ordered on a non-key field that doesn’t have a distinct value for each record that field is called the clustering field of the file. Index created is clustering index. Speeds retrieval of records. Ordered file with two fields: One field is like clustering field of the data file and second is a block pointer.

Ordered Indices

A file may have several indices, on different search keys .If the file containing the records is sequentially ordered, a primary index is an index whose search key also defines the sequential order of the file. Also called as clustering indices. Search key of a primary index is usually the primary key.

Indices whose search key specifies an order different from the sequential order of the file are celled secondary indices

Primary index: All files are arranged sequentially on some search key. Such files with a primary index on the search key is called index-sequential files. Designed for applications that require both sequential processing and random processing.

Dense and Sparse index

• Index record or index entry consists of a search-key value, and pointers to one or more records with that value as their search-key value. Pointer to a record consists of the identifier of a disk block and an offset within the disk block to identify record within the block.

• Two types :

• Dense index and

• Sparse index.

Dense index: An index record appears for every search key value in the file.

Index record has the search key value and a pointer to the first data record with that search key value.

Rest of the record, with same search key-value would be stored sequentially after first record.

Implementations store list of pointers to all records with the same search-key value

Sparse indices

An index record appears for only some of the search-key values .Each index record contains a search key value and a pointer to the first data record with that search-key value. To locate a record, the index entry with largest search-key value that is less than or equal to the search key value for which we are looking.

Start at the record pointed to by that index entry and follow the pointers in the file until we find the desired record.

Must be dense, with an index for every search-key value and pointer to every record in the file.

Primary index may be sparse storing only some of the search-key values, since its always possible to find records with intermediate search-key values by a sequential access to a part of the file.

If secondary index stores only some of the search-key values, records with intermediate search-key values, may be anywhere in the file and in general , we cannot find them without searching the entire file.

MODULE 3: DATABASE DESIGNRelational Data Model – Relational Algebra – ER Diagrams – Data Dictionary –Normalisation – Boyce Codd Normal Forms – Integrity – Relational DatabaseLanguages – Database Administration – File Structures and Indexing.

A relation is a table that holds the data we are interested in. It is two-dimensional and has rows and columns.

Each entity type in the ER model is mapped into a relation.

The attributes become the columns. The individual entities become the rows.

Figure : a relation

Relations can be represented textually as:

tablename(primary key, attribute 1, attribute 2, ... , foreign key)

If matric_no was the primary key, and there were no foreign keys, then the table above could be represented as:

student(matric no, name, address, date_of_birth)

When referring to relations or tables, cardinality is considered to the the number of rows in the relation or table, and arity is the number of columns in a table or attributes in a relation.

Foreign keys

A foreign key is an attribute (or group of attributes) that is the primary key to another relation.

Roughly, each foreign key represents a relationship between two entity types. They are added to relations as we go through the mapping process. They allow the relations to be linked together. A relation can have several foreign keys.

The Relational Data Model

The Relation

The Relation is the basic element in a relational data model.

Relations in the Relational Data Model

A relation is subject to the following rules:

1. Relation (file, table) is a two-dimensional table.

2. Attribute (i.e. field or data item) is a column in the table.

3. Each column in the table has a unique name within that table.

4. Each column is homogeneous. Thus the entries in any column are all of the same type (e.g. age, name, employee-number, etc).

5. Each column has a domain, the set of possible values that can appear in that column.

6. A Tuple (i.e. record) is a row in the table.

7. The order of the rows and columns is not important.

8. Values of a row all relate to some thing or portion of a thing.

9. Repeating groups (collections of logically related attributes that occur multiple times within one record occurrence) are not allowed.

10. Duplicate rows are not allowed (candidate keys are designed to prevent this).

11. Cells must be single-valued (but can be variable length). Single valued means the following:

o Cannot contain multiple values such as 'A1,B2,C3'.o Cannot contain combined values such as 'ABC-XYZ' where 'ABC' means

one thing and 'XYZ' another.

A relation may be expressed using the notation R(A,B,C, ...) where:

R = the name of the relation.

(A,B,C, ...) = the attributes within the relation.

A = the attribute(s) which form the primary key.

Keys

1. A simple key contains a single attribute.

2. A composite key is a key that contains more than one attribute.

3. A candidate key is an attribute (or set of attributes) that uniquely identifies a row. A candidate key must possess the following properties:

o Unique identification - For every row the value of the key must uniquely identify that row.

o Non redundancy - No attribute in the key can be discarded without destroying the property of unique identification.

4. A primary key is the candidate key which is selected as the principal unique identifier. Every relation must contain a primary key. The primary key is usually the key selected to identify a row when the database is physically implemented. For example, a part number is selected instead of a part description.

5. A superkey is any set of attributes that uniquely identifies a row. A superkey differs from a candidate key in that it does not require the non redundancy property.

6. A foreign key is an attribute (or set of attributes) that appears (usually) as a non key attribute in one relation and as a primary key attribute in another relation. I say usually because it is possible for a foreign key to also be the whole or part of a primary key:

o A many-to-many relationship can only be implemented by introducing an intersection or link table which then becomes the child in two one-to-many relationships. The intersection table therefore has a foreign key for each of its parents, and its primary key is a composite of both foreign keys.

o A one-to-one relationship requires that the child table has no more than one occurrence for each parent, which can only be enforced by letting the foreign key also serve as the primary key.

7. A semantic or natural key is a key for which the possible values have an obvious meaning to the user or the data. For example, a semantic primary key for a COUNTRY entity might contain the value 'USA' for the occurrence describing the United States of America. The value 'USA' has meaning to the user.

8. A technical or surrogate or artificial key is a key for which the possible values have no obvious meaning to the user or the data. These are used instead of semantic keys for any of the following reasons:

o When the value in a semantic key is likely to be changed by the user, or can have duplicates. For example, on a PERSON table it is unwise to use PERSON_NAME as the key as it is possible to have more than one person with the same name, or the name may change such as through marriage.

http://www.tonymarston.net/php-mysql/database-design.html#keys

o When none of the existing attributes can be used to guarantee uniqueness. In this case adding an attribute whose value is generated by the system, e.g from a sequence of numbers, is the only way to provide a unique value. Typical examples would be ORDER_ID and INVOICE_ID. The value '12345' has no meaning to the user as it conveys nothing about the entity to which it relates.

9. A key functionally determines the other attributes in the row, thus it is always a determinant.

10. Note that the term 'key' in most DBMS engines is implemented as an index which does not allow duplicate entries.

Relationships

One table (relation) may be linked with another in what is known as a relationship. Relationships may be built into the database structure to facilitate the operation of relational joins at runtime.

1. A relationship is between two tables in what is known as a one-to-many or parent-child or master-detail relationship where an occurrence on the 'one' or 'parent' or 'master' table may have any number of associated occurrences on the 'many' or 'child' or 'detail' table. To achieve this the child table must contain fields which link back the primary key on the parent table. These fields on the child table are known as a foreign key, and the parent table is referred to as the foreign table (from the viewpoint of the child).

2. It is possible for a record on the parent table to exist without corresponding records on the child table, but it should not be possible for an entry on the child table to exist without a corresponding entry on the parent table.

3. A child record without a corresponding parent record is known as an orphan.

4. It is possible for a table to be related to itself. For this to be possible it needs a foreign key which points back to the primary key. Note that these two keys cannot be comprised of exactly the same fields otherwise the record could only ever point to itself.

5. A table may be the subject of any number of relationships, and it may be the parent in some and the child in others.

6. Some database engines allow a parent table to be linked via a candidate key, but if this were changed it could result in the link to the child table being broken.

7. Some database engines allow relationships to be managed by rules known as referential integrity or foreign key restraints. These will prevent entries on child tables from being created if the foreign key does not exist on the parent

http://www.tonymarston.net/php-mysql/database-design.html#joins

http://www.tonymarston.net/php-mysql/database-design.html#determinant

table, or will deal with entries on child tables when the entry on the parent table is updated or deleted.

Relational Joins

The join operator is used to combine data from two or more relations (tables) in order to satisfy a particular query. Two relations may be joined when they share at least one common attribute. The join is implemented by considering each row in an instance of each relation. A row in relation R1 is joined to a row in relation R2 when the value of the common attribute(s) is equal in the two relations. The join of two relations is often called a binary join.

The join of two relations creates a new relation. The notation R1 x R2 indicates the join of relations R1 and R2. For example, consider the following:

Relation R1

A B C

1 5 3

2 4 5

8 3 5

9 3 3

1 6 5

5 4 3

2 7 5

Relation R2

B D E

4 7 4

6 2 3

5 7 8

7 2 3

3 2 2

Note that the instances of relation R1 and R2 contain the same data values for attribute B. Data normalisation is concerned with decomposing a relation (e.g. R(A,B,C,D,E) into smaller relations (e.g. R1 and R2). The data values for attribute B in this context

will be identical in R1 and R2. The instances of R1 and R2 are projections of the instances of R(A,B,C,D,E) onto the attributes (A,B,C) and (B,D,E) respectively. A projection will not eliminate data values - duplicate rows are removed, but this will not remove a data value from any attribute.

The join of relations R1 and R2 is possible because B is a common attribute. The result of the join is:

Relation R1 x R2

A B C D E

1 5 3 7 8

2 4 5 7 4

8 3 5 2 2

9 3 3 2 2

1 6 5 2 3

5 4 3 7 4

2 7 5 2 3

The row (2 4 5 7 4) was formed by joining the row (2 4 5) from relation R1 to the row (4 7 4) from relation R2. The two rows were joined since each contained the same value for the common attribute B. The row (2 4 5) was not joined to the row (6 2 3) since the values of the common attribute (4 and 6) are not the same.

The relations joined in the preceding example shared exactly one common attribute. However, relations may share multiple common attributes. All of these common attributes must be used in creating a join. For example, the instances of relations R1 and R2 in the following example are joined using the common attributes B and C:

Before the join:

Relation R1

A B C

6 1 4

8 1 4

5 1 2

2 7 1

Relation R2

B C D

1 4 9

1 4 2

1 2 1

7 1 2

7 1 3

After the join:

Relation R1 x R2

A B C D

6 1 4 9

6 1 4 2

8 1 4 9

8 1 4 2

5 1 2 1

2 7 1 2

2 7 1 3

The row (6 1 4 9) was formed by joining the row (6 1 4) from relation R1 to the row (1 4 9) from relation R2. The join was created since the common set of attributes (B and C) contained identical values (1 and 4). The row (6 1 4) from R1 was not joined to the row (1 2 1) from R2 since the common attributes did not share identical values - (1 4) in R1 and (1 2) in R2.

The join operation provides a method for reconstructing a relation that was decomposed into two relations during the normalisation process. The join of two rows, however, can create a new row that was not a member of the original relation. Thus invalid information can be created during the join process.

Lossless Joins

A set of relations satisfies the lossless join property if the instances can be joined without creating invalid data (i.e. new rows). The term lossless join may be somewhat confusing. A join that is not lossless will contain extra, invalid rows. A join that is lossless will not contain extra, invalid rows. Thus the term gainless join might be more appropriate.

To give an example of incorrect information created by an invalid join let us take the following data structure:

R(student, course, instructor, hour, room, grade)

Assuming that only one section of a class is offered during a semester we can define the following functional dependencies:

1. (HOUR, ROOM) COURSE

2. (COURSE, STUDENT) GRADE

3. (INSTRUCTOR, HOUR) ROOM

4. (COURSE) INSTRUCTOR

5. (HOUR, STUDENT) ROOM

Take the following sample data:

STUDENT COURSE INSTRUCTOR HOUR ROOM GRADE

Smith Math 1 Jenkins 8:00 100 A

Jones English Goldman 8:00 200 B

Brown English Goldman 8:00 200 C

Green Algebra Jenkins 9:00 400 A

The following four relations, each in 4th normal form, can be generated from the given and implied dependencies:

R1(STUDENT, HOUR, COURSE)

R2(STUDENT, COURSE, GRADE)

R3(COURSE, INSTRUCTOR)

http://www.tonymarston.net/php-mysql/database-design.html#4nf

http://www.tonymarston.net/php-mysql/database-design.html#functional.dependency

http://www.tonymarston.net/php-mysql/database-design.html#joins

http://www.tonymarston.net/php-mysql/database-design.html#normalisation

R4(INSTRUCTOR, HOUR, ROOM)

Note that the dependencies (HOUR, ROOM) COURSE and (HOUR, STUDENT) ROOM are not explicitly represented in the preceding decomposition. The goal is

to develop relations in 4th normal form that can be joined to answer any ad hoc inquiries correctly. This goal can be achieved without representing every functional dependency as a relation. Furthermore, several sets of relations may satisfy the goal.

The preceding sets of relations can be populated as follows:

R1

STUDENT HOUR COURSE

Smith 8:00 Math 1

Jones 8:00 English

Brown 8:00 English

Green 9:00 Algebra

R2

STUDENT COURSE GRADE

Smith Math 1 A

Jones English B

Brown English C

Green Algebra A

R3

COURSE INSTRUCTOR

Math 1 Jenkins

English Goldman

Algebra Jenkins

R4

INSTRUCTOR HOUR ROOM




Jenkins 8:00 100

Goldman 8:00 200

Jenkins 9:00 400

Now suppose that a list of courses with their corresponding room numbers is required. Relations R1 and R4 contain the necessary information and can be joined using the attribute HOUR.

The result of this join is:

R1 x R4

STUDENT COURSE INSTRUCTOR HOUR ROOM

Smith Math 1 Jenkins 8:00 100

Smith Math 1 Goldman 8:00 200

Jones English Jenkins 8:00 100

Jones English Goldman 8:00 200

Brown English Jenkins 8:00 100

Brown English Goldman 8:00 200

Green Algebra Jenkins 9:00 400

This join creates the following invalid information (denoted by the coloured rows):

Smith, Jones, and Brown take the same class at the same time from two different instructors in two different rooms.

Jenkins (the Maths teacher) teaches English.

Goldman (the English teacher) teaches Maths.

Both instructors teach different courses at the same time.

Another possibility for a join is R3 and R4 (joined on INSTRUCTOR). The result would be:

R3 x R4

COURSE INSTRUCTOR HOUR ROOM

Math 1 Jenkins 8:00 100

Math 1 Jenkins 9:00 400

English Goldman 8:00 200

Algebra Jenkins 8:00 100

Algebra Jenkins 9:00 400

This join creates the following invalid information:

Jenkins teaches Math 1 and Algebra simultaneously at both 8:00 and 9:00.

A correct sequence is to join R1 and R3 (using COURSE) and then join the resulting relation with R4 (using both INSTRUCTOR and HOUR). The result would be:

R1 x R3

STUDENT COURSE INSTRUCTOR HOUR

Smith Math 1 Jenkins 8:00

Jones English Goldman 8:00

Brown English Goldman 8:00

Green Algebra Jenkins 9:00

(R1 x R3) x R4

STUDENT COURSE INSTRUCTOR HOUR ROOM

Smith Math 1 Jenkins 8:00 100

Jones English Goldman 8:00 200

Brown English Goldman 8:00 200

Green Algebra Jenkins 9:00 400

Extracting the COURSE and ROOM attributes (and eliminating the duplicate row produced for the English course) would yield the desired result:

COURSE ROOM

Math 1 100

English 200

Algebra 400

The correct result is obtained since the sequence (R1 x r3) x R4 satisfies the lossless (gainless?) join property.

A relational database is in 4th normal form when the lossless join property can be used to answer unanticipated queries. However, the choice of joins must be evaluated carefully. Many different sequences of joins will recreate an instance of a relation. Some sequences are more desirable since they result in the creation of less invalid data during the join operation.

Suppose that a relation is decomposed using functional dependencies and multi-valued dependencies. Then at least one sequence of joins on the resulting relations exists that recreates the original instance with no invalid data created during any of the join operations.

For example, suppose that a list of grades by room number is desired. This question, which was probably not anticipated during database design, can be answered without creating invalid data by either of the following two join sequences:

R1 x R3

(R1 x R3) x R2

((R1 x R3) x R2) x R4

or

R1 x R3

(R1 x R3) x R4

((R1 x R3) x R4) x R2

The required information is contained with relations R2 and R4, but these relations cannot be joined directly. In this case the solution requires joining all 4 relations.

The database may require a 'lossless join' relation, which is constructed to assure that any ad hoc inquiry can be answered with relational operators. This relation may contain attributes that are not logically related to each other. This occurs because the relation must serve as a bridge between the other relations in the database. For example, the lossless join relation will contain all attributes that appear only on the left side of a functional dependency. Other attributes may also be required, however, in developing the lossless join relation.


http://www.tonymarston.net/php-mysql/database-design.html#multi.valued.dependency




Consider relational schema R(A, B, C, D), A B and C D. Relations Rl(A, B) and R2(C, D) are in 4th normal form. A third relation R3(A, C), however, is required to satisfy the lossless join property. This relation can be used to join attributes B and D. This is accomplished by joining relations R1 and R3 and then joining the result to relation R2. No invalid data is created during these joins. The relation R3(A, C) is the lossless join relation for this database design.

A relation is usually developed by combining attributes about a particular subject or entity. The lossless join relation, however, is developed to represent a relationship among various relations. The lossless join relation may be difficult to populate initially and difficult to maintain - a result of including attributes that are not logically associated with each other.

The attributes within a lossless join relation often contain multi-valued dependencies. Consideration of 4th normal form is important in this situation. The lossless join relation can sometimes be decomposed into smaller relations by eliminating the multi-valued dependencies. These smaller relations are easier to populate and maintain.

Determinant and Dependent

The terms determinant and dependent can be described as follows:

1. The expression X Y means 'if I know the value of X, then I can obtain the value of Y' (in a table or somewhere).

2. In the expression X Y, X is the determinant and Y is the dependent attribute.

3. The value X determines the value of Y.

4. The value Y depends on the value of X.

Functional Dependencies (FD)

A functional dependency can be described as follows:

1. An attribute is functionally dependent if its value is determined by another attribute which is a key.

2. That is, if we know the value of one (or several) data items, then we can find the value of another (or several).

3. Functional dependencies are expressed as X Y, where X is the determinant and Y is the functionally dependent attribute.

4. If A (B,C) then A B and A C.

5. If (A,B) C, then it is not necessarily true that A C and B C.

6. If A B and B A, then A and B are in a 1-1 relationship.

7. If A B then for A there can only ever be one value for B.


http://www.tonymarston.net/php-mysql/database-design.html#dependent






Transitive Dependencies (TD)

A transitive dependency can be described as follows:

1. An attribute is transitively dependent if its value is determined by another attribute which is not a key.

2. If X Y and X is not a key then this is a transitive dependency.

3. A transitive dependency exists when A B C but NOT A C.

Multi-Valued Dependencies (MVD)

A multi-valued dependency can be described as follows:

1. A table involves a multi-valued dependency if it may contain multiple values for an entity.

2. A multi-valued dependency may arise as a result of enforcing 1st normal form.

3. X Y, ie X multi-determines Y, when for each value of X we can have more than one value of Y.

4. If A B and A C then we have a single attribute A which multi-determines two other independent attributes, B and C.

5. If A (B,C) then we have an attribute A which multi-determines a set of associated attributes, B and C.

Join Dependencies (JD)

A join dependency can be described as follows:

1. If a table can be decomposed into three or more smaller tables, it must be capable of being joined again on common keys to form the original table.

Modification Anomalies

A major objective of data normalisation is to avoid modification anomalies. These come in two flavours:

1. An insertion anomaly is a failure to place information about a new database entry into all the places in the database where information about that new entry needs to be stored. In a properly normalized database, information about a new entry needs to be inserted into only one place in the database. In an inadequately normalized database, information about a new entry may need to be inserted into more than one place, and, human fallibility being what it is, some of the needed additional insertions may be missed.

2. A deletion anomaly is a failure to remove information about an existing database entry when it is time to remove that entry. In a properly normalized database, information about an old, to-be-gotten-rid-of entry needs to be deleted

http://www.tonymarston.net/php-mysql/database-design.html#normalisation



http://www.tonymarston.net/php-mysql/database-design.html#dependent

from only one place in the database. In an inadequately normalized database, information about that old entry may need to be deleted from more than one place, and, human fallibility being what it is, some of the needed additional deletions may be missed.

An update of a database involves modifications that may be additions, deletions, or both. Thus 'update anomalies' can be either of the kinds of anomalies discussed above.

All three kinds of anomalies are highly undesirable, since their occurrence constitutes corruption of the database. Properly normalised databases are much less susceptible to corruption than are unnormalised databases.

Types of Relational Join

A JOIN is a method of creating a result set that combines rows from two or more tables (relations). When comparing the contents of two tables the following conditions may occur:

Every row in one relation has a match in the other relation.

Relation R1 contains rows that have no match in relation R2.

Relation R2 contains rows that have no match in relation R1.

INNER joins contain only matches. OUTER joins may contain mismatches as well.

Inner Join

This is sometimes known as a simple join. It returns all rows from both tables where there is a match. If there are rows in R1 which do not have matches in R2, those rows will not be listed. There are two possible ways of specifying this type of join:

SELECT * FROM R1, R2 WHERE R1.r1_field = R2.r2_field;SELECT * FROM R1 INNER JOIN R2 ON R1.field = R2.r2_field

If the fields to be matched have the same names in both tables then the ON condition, as in:

ON R1.fieldname = R2.fieldname ON (R1.field1 = R2.field1 AND R1.field2 = R2.field2)

can be replaced by the shorter USING condition, as in:

USING fieldname USING (field1, field2)

Natural Join

A natural join is based on all columns in the two tables that have the same name. It is semantically equivalent to an INNER JOIN or a LEFT JOIN with a USING clause that names all columns that exist in both tables.

SELECT * FROM R1 NATURAL JOIN R2

The alternative is a keyed join which includes an ON or USING condition.

Left [Outer] Join

Returns all the rows from R1 even if there are no matches in R2. If there are no matches in R2 then the R2 values will be shown as null.

SELECT * FROM R1 LEFT [OUTER] JOIN R2 ON R1.field = R2.field

Right [Outer] Join

Returns all the rows from R2 even if there are no matches in R1. If there are no matches in R1 then the R1 values will be shown as null.

SELECT * FROM R1 RIGHT [OUTER] JOIN R2 ON R1.field = R2.field

Full [Outer] Join

Returns all the rows from both tables even if there are no matches in one of the tables. If there are no matches in one of the tables then its values will be shown as null.

SELECT * FROM R1 FULL [OUTER] JOIN R2 ON R1.field = R2.field

Self Join

This joins a table to itself. This table appears twice in the FROM clause and is followed by table aliases that qualify column names in the join condition.

SELECT a.field1, b.field2 FROM R1 a, R1 b WHERE a.field = b.field

Cross Join

This type of join is rarely used as it does not have a join condition, so every row of R1 is joined to every row of R2. For example, if both tables contain 100 rows the result will be 10,000 rows. This is sometimes known as a cartesian product and can be specified in either one of the following ways:

SELECT * FROM R1 CROSS JOIN R2SELECT * FROM R1, R2

http://www.tonymarston.net/php-mysql/database-design.html#left.outer.join

http://www.tonymarston.net/php-mysql/database-design.html#inner.join

Entities

An entity is any object in the system that we want to model and store information about

Individual objects are called entities Groups of the same type of objects are called entity types or entity sets Entities are represented by rectangles (either with round or square corners)

There are two types of entities; weak and strong entity types.

Attribute

All the data relating to an entity is held in its attributes. An attribute is a property of an entity. Each attribute can have any value from its domain. Each entity within an entity type:

o May have any number of attributes. o Can have different attribute values than that in any other entity. o Have the same number of attributes.

Attributes can be simple or composite single-valued or multi-valued Attributes can be shown on ER models They appear inside ovals and are attached to their entity. Note that entity types can have a large number of attributes... If all are shown

then the diagrams would be confusing. Only show an attribute if it adds information to the ER diagram, or clarifies a point.

Figure : Attributes

Keys

A key is a data item that allows us to uniquely identify individual occurrences or an entity type.

A candidate key is an attribute or set of attributes that uniquely identifies individual occurrences or an entity type.

An entity type may have one or more possible candidate keys, the one which is selected is known as the primary key.

A composite key is a candidate key that consists of two or more attributes The name of each primary key attribute is underlined.

Relationships

A relationship type is a meaningful association between entity types A relationship is an association of entities where the association includes one

entity from each participating entity type. Relationship types are represented on the ER diagram by a series of lines. As always, there are many notations in use today... In the original Chen notation, the relationship is placed inside a diamond, e.g.

managers manage employees:

Figure : Chens notation for relationships

For this module, we will use an alternative notation, where the relationship is a label on the line. The meaning is identical

Figure : Relationships used in this document

Degree of a Relationship

The number of participating entities in a relationship is known as the degree of the relationship.

If there are two entity types involved it is a binary relationship type

Figure : Binary Relationships

If there are three entity types involved it is a ternary relationship type

Figure : Ternary relationship

It is possible to have a n-ary relationship (e.g. quaternary or unary). Unary relationships are also known as a recursive relationship.

It is a relationship where the same entity participates more than once in different roles.

In the example above we are saying that employees are managed by employees.

If we wanted more information about who manages whom, we could introduce a second entity type called manager.

Degree of a Relationship

It is also possible to have entities associated through two or more distinct relationships.

Figure : Multiple relationships

In the representation we use it is not possible to have attributes as part of a relationship. To support this other entity types need to be developed.

Replacing ternary relationships

When ternary relationships occurs in an ER model they should always be removed before finishing the model. Sometimes the relationships can be replaced by a series of binary relationships that link pairs of the original ternary relationship.

Figure : A ternary relationship example

This can result in the loss of some information - It is no longer clear which sales assistant sold a customer a particular product.

Try replacing the ternary relationship with an entity type and a set of binary relationships.

Relationships are usually verbs, so name the new entity type by the relationship verb rewritten as a noun.

The relationship sells can become the entity type sale.

Figure : Replacing a ternary relationship

So a sales assistant can be linked to a specific customer and both of them to the sale of a particular product.

This process also works for higher order relationships.

Cardinality

Relationships are rarely one-to-one For example, a manager usually manages more than one employee This is described by the cardinality of the relationship, for which there are four

possible categories. One to one (1:1) relationship One to many (1:m) relationship Many to one (m:1) relationship Many to many (m:n) relationship On an ER diagram, if the end of a relationship is straight, it represents 1, while

a "crow's foot" end represents many. A one to one relationship - a man can only marry one woman, and a woman

can only marry one man, so it is a one to one (1:1) relationship

Figure : One to One relationship example

A one to may relationship - one manager manages many employees, but each employee only has one manager, so it is a one to many (1:n) relationship

Figure : One to Many relationship example

A many to one relationship - many students study one course. They do not study more than one course, so it is a many to one (m:1) relationship

Figure : Many to One relationship example

A many to many relationship - One lecturer teaches many students and a student is taught by many lecturers, so it is a many to many (m:n) relationship

Figure : Many to Many relationship example

Optionality

A relationship can be optional or mandatory.

If the relationship is mandatory an entity at one end of the relationship must be related to an entity at the other

end. The optionality can be different at each end of the relationship For example, a student must be on a course. This is mandatory. To the

relationship `student studies course' is mandatory. But a course can exist before any students have enrolled. Thus the relationship

`course is_studied_by student' is optional. To show optionality, put a circle or `0' at the òptional end' of the relationship. As the optional relationship is `course is_studied_by student', and the optional

part of this is the student, then the Ò' goes at the student end of the relationship connection.

Figure : Optionality example

It is important to know the optionality because you must ensure that whenever you create a new entity it has the required mandatory links.

Entity Sets

Sometimes it is useful to try out various examples of entities from an ER model. One reason for this is to confirm the correct cardinality and optionality of a relationship. We use an èntity set diagram' to show entity examples graphically. Consider the example of `course is_studied_by student'.

Figure : Entity set example

Confirming Correctness

Figure : Entity set confirming errors

Use the diagram to show all possible relationship scenarios. Go back to the requirements specification and check to see if they are allowed. If not, then put a cross through the forbidden relationships This allows you to show the cardinality and optionality of the relationship

Deriving the relationship parameters

To check we have the correct parameters (sometimes also known as the degree) of a relationship, ask two questions:

1. One course is studied by how many students? Answer = `zero or more'. o This gives us the degree at the `student' end. o The answer `zero or more' needs to be split into two parts. o The `more' part means that the cardinality is `many'. o The `zero' part means that the relationship is òptional'. o If the answer was òne or more', then the relationship would be

`mandatory'. 2. One student studies how many courses? Answer = Òne'

o This gives us the degree at the `course' end of the relationship. o The answer òne' means that the cardinality of this relationship is 1, and

is `mandatory' o If the answer had been `zero or one', then the cardinality of the

relationship would have been 1, and be òptional'.

Redundant relationships

Some ER diagrams end up with a relationship loop.

check to see if it is possible to break the loop without losing info Given three entities A, B, C, where there are relations A-B, B-C, and C-A,

check if it is possible to navigate between A and C via B. If it is possible, then A-C was a redundant relationship.

Always check carefully for ways to simplify your ER diagram. It makes it easier to read the remaining information.

Redundant relationships example

Consider entities `customer' (customer details), `address' (the address of a customer) and `distance' (distance from the company to the customer address).

Figure : Redundant relationship

Splitting n:m Relationships

A many to many relationship in an ER model is not necessarily incorrect. They can be replaced using an intermediate entity. This should only be done where:

the m:n relationship hides an entity the resulting ER diagram is easier to understand.

Splitting n:m Relationships - Example

Consider the case of a car hire company. Customers hire cars, one customer hires many card and a car is hired by many customers.

Figure : Many to Many example

The many to many relationship can be broken down to reveal a `hire' entity, which contains an attribute `date of hire'.

Figure : Splitting the Many to Many example

Constructing an ER model

Before beginning to draw the ER model, read the requirements specification carefully. Document any assumptions you need to make.

1. Identify entities - list all potential entity types. These are the object of interest in the system. It is better to put too many entities in at this stage and them discard them later if necessary.

2. Remove duplicate entities - Ensure that they really separate entity types or just two names for the same thing.

o Also do not include the system as an entity type o e.g. if modelling a library, the entity types might be books, borrowers,

etc. o The library is the system, thus should not be an entity type.

3. List the attributes of each entity (all properties to describe the entity which are relevant to the application).

o Ensure that the entity types are really needed. o are any of them just attributes of another entity type? o if so keep them as attributes and cross them off the entity list. o Do not have attributes of one entity as attributes of another entity!

4. Mark the primary keys. o Which attributes uniquely identify instances of that entity type? o This may not be possible for some weak entities.

5. Define the relationships o Examine each entity type to see its relationship to the others.

6. Describe the cardinality and optionality of the relationships o Examine the constraints between participating entities.

7. Remove redundant relationships o Examine the ER model for redundant relationships.

ER modelling is an iterative process, so draw several versions, refining each one until you are happy with it.

Normalisation is the process of taking data from a problem and reducing it to a set of relations while ensuring data integrity and eliminating data redundancy

Data integrity - all of the data in the database are consistent, and satisfy all integrity constraints.

Data redundancy – if data in the database can be found in two different locations (direct redundancy) or if data can be calculated from other data items (indirect redundancy) then the data is said to contain redundancy.

Data should only be stored once and avoid storing data that can be calculated from other data already held in the database. During the process of normalisation redundancy must be removed, but not at the expense of breaking data integrity rules.

If redundancy exists in the database then problems can arise when the database is in normal operation:

When data is inserted the data must be duplicated correctly in all places where there is redundancy. For instance, if two tables exist for in a database, and both tables contain the employee name, then creating a new employee entry requires that both tables be updated with the employee name.

When data is modified in the database, if the data being changed has redundancy, then all versions of the redundant data must be updated simultaneously. So in the employee example a change to the employee name must happen in both tables simultaneously.

The removal of redundancy helps to prevent insertion, deletion, and update errors, since the data is only available in one attribute of one table in the database.

The data in the database can be considered to be in one of a number of `normal forms'. Basically the normal form of the data indicates how much redundancy is in that data. The normal forms have a strict ordering:

1. 1st Normal Form2. 2nd Normal Form3. 3rd Normal Form4. BCNF

There are other normal forms, such as 4th and 5th normal forms. They are rarely utilised in system design and are not considered further here.

To be in a particular form requires that the data meets the criteria to also be in all normal forms before that form. Thus to be in 2nd normal form the data must meet the criteria for both 2nd normal form and 1st normal form. The higher the form the more redundancy has been eliminated.

Integrity Constraints

An integrity constraint is a rule that restricts the values that may be present in the database. The relational data model includes constraints that are used to verify the validity of the data as well as adding meaningful structure to it:

entity integrity :

The rows (or tuples) in a relation represent entities, and each one must be uniquely identified. Hence we have the primary key that must have a unique non-null value for each row.

referential integrity :

This constraint involves the foreign keys. Foreign keys tie the relations together, so it is vitally important that the links are correct. Every foreign key must either be null or its value must be the actual value of a key in another relation.

Understanding Data

Sometimes the starting point for understanding data is given in the form of relations and functional dependancies. This would be the case where the starting point in the

process was a detailed specification of the problem. We already know what relations are. Functional dependancies are rules stating that given a certain set of attributes (the determinant) determines a second set of attributes.

The definition of a functional dependency looks like A->B. In this case B is a single attribute but it can be as many attributes as required (for instance, X->J,K,L,M). In the functional dependency, the determinant (the left hand side of the -> sign) can determine the set of attributes on the right hand side of the -> sign. This basically means that A selects a particular value for B, and that A is unique. In the second example X is unique and selects a particular set of values for J,K,L, and M. It can also be said that B is functionally dependent on A. In addition, a particular value of A ALWAYS gives you a particular value for B, but not vice-versa.

Consider this example:

R(matric_no, firstname, surname, tutor_number, tutor_name)

tutor_number -> tutor_name

Here there is a relation R, and a functional dependency that indicates that:

instances of tutor_number are unique in the data from the data, given a tutor_number, it is always possible to work out the

tutor_name. As an example tutor number 1 may be “Mr Smith”, but tutor number 10 may

also be “Mr Smith”. Given a tutor number of 1, this is ALWAYS “Mr Smith”. However, given the name “Mr Smith” it is not possible to work out if we are talking about tutor 1 or tutor 10.

There is actually a second functional dependency for this relation, which can be worked out from the relation itself. As the relation has a primary key, then given this attribute you can determine all the other attributes in R. This is an implied functional dependency and is not normally listed in the list of functional dependents.

Extracting understanding

It is possible that the relations and the determinants have not yet been defined for a problem, and therefore must be calculated from examples of the data. Consider the following Student table.

Student - an unnormalised tablewith repeating groupsmatric_no Name date_of_birth subject grade

960100 Smith, J 14/11/1977DatabasesSoft_DevISDE

CAD

960105 White, A 10/05/1975 Soft_Dev B

ISDE B

960120 Moore, T 11/03/1970DatabasesSoft_DevWorkshop

ABC

960145 Smith, J 09/01/1972 Databases B

960150 Black, D 21/08/1973

DatabasesSoft_DevISDEWorkshop

BDCD

The subject/grade pair is repeated for each student. 960145 has 1 pair while 960150 has four. Repeating groups are placed inside another set of parentheses. From the table the following relation is generated:

Student(matric_no, name, date_of_birth, ( subject, grade ) )

The repeating group needs a key in order that the relation can be correctly defined. Looking at the data one can see that grade repeats within matric_no (for instance, for 960150, the student has 2 D grades). However, subject never seems to repeat for a single matric_no, and therefore is a candidate key in the repeating group.Whenever keys or dependencies are extracted from example data, the information extracted is only as good as the data sample examined. It could be that another data sample disproves some of the key selections made or dependencies extracted. What is important however is that the information extracted during these exercises is correct for the data being examined.

Looking at the data itself, we can see that the same name appears more than once in the name column. The name in conjunction with the date_of_birth seems to be unique, suggesting a functional dependency of:

name, date_of_birth -> matric_no

This implies that not only is the matric_no sufficient to uniquely identify a student, the student’s name combined with the date of birth is also sufficient to uniquely identify a student. It is therefore possible to have the relation Student written as:


As guidance in cases where a variety of keys could be selected one should try to select the relation with the least number of attributes defined as primary keys.

Flattened Tables

Note that the student table shown above explicitly identifies the repeating group. It is also possible that the table presented will be what is called a flat table, where the repeating group is not explicitly shown:

Student #2 - Flattened Tablematric_no name date_of_birth Subject grade

960100 Smith, J 14/11/1977 Databases C

960100 Smith, J 14/11/1977 Soft_Dev A

960100 Smith, J 14/11/1977 ISDE D

960105 White, A 10/05/1975 Soft_Dev B

960105 White, A 10/05/1975 ISDE B

960120 Moore, T 11/03/1970 Databases A

960120 Moore, T 11/03/1970 Soft_Dev B

960120 Moore, T 11/03/1970 Workshop C

960145 Smith, J 09/01/1972 Databases B

960150 Black, D 21/08/1973 Databases B

960150 Black, D 21/08/1973 Soft_Dev D

960150 Black, D 21/08/1973 ISDE C

960150 Black, D 21/08/1973 Workshop B

The table still shows the same data as the previous example, but the format is different. We have removed the repeating group (which is good) but we have introduced redundancy (which is bad).

Sometimes you will miss spotting the repeating group, so you may produce something like the following relation for the Student data.

Student(matric_no, name, date_of_birth, subject, grade )

matric_no -> name, date_of_birth name, date_of_birth -> matric_no

This data does not explicitly identify the repeating group, but as you will see the result of the normalisation process on this relation produces exactly the same relations as the normalisation of the version that explicitly does have a repeating group.

First Normal Form

First normal form (1NF) deals with the `shape' of the record type A relation is in 1NF if, and only if, it contains no repeating attributes or groups

of attributes. Example: The Student table with the repeating group is not in 1NF It has repeating groups, and it is called an `unnormalised table'.

Relational databases require that each row only has a single value per attribute, and so a repeating group in a row is not allowed.

To remove the repeating group, one of two things can be done:

either flatten the table and extend the key, or decompose the relation- leading to First Normal Form

Flatten table and Extend Primary Key

The Student table with the repeating group can be written as:


If the repeating group was flattened, as in the Student #2 data table, it would look something like:


Although this is an improvement, we still have a problem. matric_no can no longer be the primary key - it does not have an unique value for each row. So we have to find a new primary key - in this case it has to be a compound key since no single attribute can uniquely identify a row. The new primary key is a compound key (matrix_no + subject).

We have now solved the repeating groups problem, but we have created other complications. Every repetition of the matric_no, name, and data_of_birth is redundant and liable to produce errors.

With the relation in its flattened form, strange anomalies appear in the system. Redundant data is the main cause of insertion, deletion, and updating anomalies.

Insertion anomaly:

With the primary key including subject, we cannot enter a new student until they have at least one subject to study. We are not allowed NULLs in the primary key so we must have an entry in both matric_no and subject before we can create a new record.

This is known as the insertion anomaly. It is difficult to insert new records into the database.

On a practical level, it also means that it is difficult to keep the data up to date.

Update anomaly

If the name of a student were changed for example Smith, J. was changed to Green, J. this would require not one change but many one for every subject that Smith, J. studied.

Deletion anomaly

If all of the records for the `Databases' subject were deleted from the table,we would inadvertently lose all of the information on the student with matric_no 960145. This would be the same for any student who was studying only one subject and the subject was deleted. Again this problem arises from the need to have a compound primary key.

Decomposing the relation

The alternative approach is to split the table into two parts, one for the repeating groups and one of the non-repeating groups.

the primary key for the original relation is included in both of the new relations

Record

matric_no subject grade

960100 Databases C

960100 Soft_Dev A

960100 ISDE D

960105 Soft_Dev B

960105 ISDE B

... ... ...

960150 Workshop B

Student

matric_no name date_of_birth

960100 Smith,J 14/11/1977

960105 White,A 10/05/1975

960120 Moore,T 11/03/1970

960145 Smith,J 09/01/1972

960150 Black,D 21/08/1973

We now have two relations, Student and Record. Student contains the original non-repeating groups

Record has the original repeating groups and the matric_no

Student(matric_no, name, date_of_birth )Record(matric_no, subject, grade )

Matric_no remains the key to the Student relation. It cannot be the complete key to the new Record relation - we end up with a compound primary key consisting of matric_no and subject. The matric_no is the link between the two tables - it will allow us to find out which subjects a student is studying . So in the Record relation, matric_no is the foreign key.

This method has eliminated some of the anomalies. It does not always do so, it depends on the example chosen

In this case we no longer have the insertion anomaly It is now possible to enter new students without knowing the subjects that they

will be studying They will exist only in the Student table, and will not be entered in the Record

table until they are studying at least one subject. We have also removed the deletion anomaly If all of the `databases' subject records are removed, student 960145 still exists

in the Student table. We have also removed the update anomaly

Student and Record are now in First Normal Form.

Second Normal Form

Second normal form (or 2NF) is a more stringent normal form defined as:

A relation is in 2NF if, and only if, it is in 1NF and every non-key attribute is fully functionally dependent on the whole key.

Thus the relation is in 1NF with no repeating groups, and all non-key attributes must depend on the whole key, not just some part of it. Another way of saying this is that there must be no partial key dependencies (PKDs).

The problems arise when there is a compound key, e.g. the key to the Record relation - matric_no, subject. In this case it is possible for non-key attributes to depend on only part of the key - i.e. on only one of the two key attributes. This is what 2NF tries to prevent.

Consider again the Student relation from the flattened Student #2 table:


There are no repeating groups The relation is already in 1NF However, we have a compound primary key - so we must check all of the non-

key attributes against each part of the key to ensure they are functionally dependent on it.

matric_no determines name and date_of_birth, but not grade. subject together with matric_no determines grade, but not name or

date_of_birth. So there is a problem with potential redundancies

A dependency diagram is used to show how non-key attributes relate to each part or combination of parts in the primary key.

Figure : Dependency Diagram

This relation is not in 2NF It appears to be two tables squashed into one. the solution is to split the relation up into its component parts. separate out all the attributes that are solely dependent on matric_no put them in a new Student_details relation, with matric_no as the primary key separate out all the attributes that are solely dependent on subject. in this case no attributes are solely dependent on subject. separate out all the attributes that are solely dependent on matric_no + subject put them into a separate Student relation, keyed on matric_no + subject

All attributes in each relation are fully functionally dependent upon its primary key

These relations are now in 2NF

Figure : Dependencies after splitting

Interestingly this is the same set of relations as when we recognized that there were repeating terms in the table and directly removed the repeating terms. It should not really matter what process you followed when normalizing, as the end result should be similar relations.

Third Normal Form

3NF is an even stricter normal form and removes virtually all the redundant data :

A relation is in 3NF if, and only if, it is in 2NF and there are no transitive functional dependencies

Transitive functional dependencies arise: when one non-key attribute is functionally dependent on another non-key

attribute: FD: non-key attribute -> non-key attribute and when there is redundancy in the database

By definition transitive functional dependency can only occur if there is more than one non-key field, so we can say that a relation in 2NF with zero or one non-key field must automatically be in 3NF.

project_no manager address

p1 Black,B 32 High Street

p2 Smith,J 11 New Street



Project has more than one non-key field so we must check for transitive dependency:

address depends on the value in the manager column every time B Black is listed in the manager column, the address column has the

value `32 High Street'. From this the relation and functional dependency can be implied as:

Project(project_no, manager, address)

manager -> address in this case address is transitively dependent on manager. Manager is the

determinant - it determines the value of address. It is transitive functional dependency only if all attributes on the left of the “->” are not in the key but are all in the relation, and all attributes to the right of the “->” are not in the key with at least one actually being in the relation.

Data redundancy arises from this we duplicate address if a manager is in charge of more than one project causes problems if we had to change the address- have to change several

entries, and this could lead to errors. The solution is to eliminate transitive functional dependency by splitting the

table create two relations - one with the transitive dependency in it, and another for

all of the remaining attributes. split Project into Project and Manager.

the determinant attribute becomes the primary key in the new relation manager becomes the primary key to the Manager relation the original key is the primary key to the remaining non-transitive attributes in this case, project_no remains the key to the new Projects table.

Project project_no manager

p1 Black,B

p2 Smith,J

p3 Black,B

p4 Black,B

Manager manager address

Black,B 32 High Street

Smith,J 11 New Street

Now we need to store the address only once If we need to know a manager's address we can look it up in the Manager

relation The manager attribute is the link between the two tables, and in the Projects

table it is now a foreign key. These relations are now in third normal form.

Summary: 1NF

A relation is in 1NF if it contains no repeating groups To convert an unnormalised relation to 1NF either: Flatten the table and change the primary key, or Decompose the relation into smaller relations, one for the repeating groups and

one for the non-repeating groups. Remember to put the primary key from the original relation into both new

relations. This option is liable to give the best results.

Summary: 2NF

A relation is in 2NF if it contains no repeating groups and no partial key functional dependencies

Rule: A relation in 1NF with a single key field must be in 2NF To convert a relation with partial functional dependencies to 2NF. create a set

of new relations: One relation for the attributes that are fully dependent upon the key. One relation for each part of the key that has partially dependent attributes

Summary: 3NF

A relation is in 3NF if it contains no repeating groups, no partial functional dependencies, and no transitive functional dependencies

To convert a relation with transitive functional dependencies to 3NF, remove the attributes involved in the transitive dependency and put them in a new relation

Rule: A relation in 2NF with only one non-key attribute must be in 3NF In a normalised relation a non-key field must provide a fact about the key, the

whole key and nothing but the key. Relations in 3NF are sufficient for most practical database design problems.

However, 3NF does not guarantee that all anomalies have been removed.

Boyce-Codd Normal Form (BCNF)

When a relation has more than one candidate key, anomalies may result even though the relation is in 3NF.

3NF does not deal satisfactorily with the case of a relation with overlapping candidate keys

i.e. composite candidate keys with at least one attribute in common. BCNF is based on the concept of a determinant. A determinant is any attribute (simple or composite) on which some other

attribute is fully functionally dependent. A relation is in BCNF is, and only if, every determinant is a candidate key.

Consider the following relation and determinants.

R(a,b,c,d) a,c -> b,d a,d -> b

Here, the first determinant suggests that the primary key of R could be changed from a,b to a,c. If this change was done all of the non-key attributes present in R could still be determined, and therefore this change is legal. However, the second determinant indicates that a,d determines b, but a,d could not be the key of R as a,d does not determine all of the non key attributes of R (it does not determine c). We would say that the first determinate is a candidate key, but the second determinant is not a candidate key, and thus this relation is not in BCNF (but is in 3rd normal form).

Normalisation to BCNF - Example 1

Patient No Patient Name Appointment Id Time Doctor

1 John 0 09:00 Zorro

2 Kerr 0 09:00 Killer

3 Adam 1 10:00 Zorro

4 Robert 0 13:00 Killer

5 Zane 1 14:00 Zorro

Lets consider the database extract shown above. This depicts a special dieting clinic where the each patient has 4 appointments. On the first they are weighed, the second they are exercised, the third their fat is removed by surgery, and on the fourth their mouth is stitched closed… Not all patients need all four appointments! If the Patient Name begins with a letter before “P” they get a morning appointment, otherwise they get an afternoon appointment. Appointment 1 is either 09:00 or 13:00, appointment 2 10:00 or 14:00, and so on. From this (hopefully) make-believe scenario we can extract the following determinants:

DB(Patno,PatName,appNo,time,doctor)

Patno -> PatNamePatno,appNo -> Time,doctorTime -> appNo

Now we have to decide what the primary key of DB is going to be. From the information we have, we could chose: DB(Patno,PatName,appNo,time,doctor) (example 1a)or DB(Patno,PatName,appNo,time,doctor) (example 1b)

Example 1a - DB(Patno,PatName,appNo,time,doctor)

1NF Eliminate repeating groups.

None:


2NF Eliminate partial key dependencies

DB(Patno,appNo,time,doctor)R1(Patno,PatName)

3NF Eliminate transitive dependencies

None: so just as 2NF

BCNF Every determinant is a candidate key DB(Patno,appNo,time,doctor) R1(Patno,PatName)

Go through all determinates where ALL of the left hand attributes are present in a relation and at least ONE of the right hand attributes are also present in the relation.

Patno -> PatNamePatno is present in DB, but not PatName, so not relevant.

Patno,appNo -> Time,doctorAll LHS present, and time and doctor also present, so relevant. Is this a candidate key? Patno,appNo IS the key, so this is a candidate key. Thus this is OK for BCNF compliance.

Time -> appNoTime is present, and so is appNo, so relevant. Is this a candidate key. If it was then we could rewrite DB as: DB(Patno,appNo,time,doctor)This will not work, as you need both time and Patno together to form a unique key. Thus this determinate is not a candidate key, and therefore DB is not in BCNF. We need to fix this.

BCNF: rewrite to DB(Patno,time,doctor) R1(Patno,PatName) R2(time,appNo)

time is enough to work out the appointment number of a patient. Now BCNF is satisfied, and the final relations shown are in BCNF.

Example 1b - DB(Patno,PatName,appNo,time,doctor)

1NF Eliminate repeating groups.

None:


2NF Eliminate partial key dependencies

DB(Patno,time,doctor)R1(Patno,PatName)R2(time,appNo)

3NF Eliminate transitive dependencies

None: so just as 2NF

BCNF Every determinant is a candidate key DB(Patno,time,doctor)

R1(Patno,PatName) R2(time,appNo) Go through all determinates where ALL of the left hand attributes are present

in a relation and at least ONE of the right hand attributes are also present in the relation.

Patno -> PatNamePatno is present in DB, but not PatName, so not relevant.

Patno,appNo -> Time,doctorNot all LHS present, so not relevant.

Time -> appNoTime is present, and so is appNo, so relevant. This is a candidate key. However, Time is currently the key for R2, so satisfies the rules for BCNF.

BCNF: as 3NF DB(Patno,time,doctor) R1(Patno,PatName) R2(time,appNo)

Summary - Example 1

This example has demonstrated three things:

BCNF is stronger than 3NF, relations that are in 3NF are not necessarily in BCNF

BCNF is needed in certain situations to obtain full understanding of the data model

there are several routes to take to arrive at the same set of relations in BCNF. Unfortunately there are no rules as to which route will be the easiest one to

take.

Example 2 Grade_report(StudNo,StudName,(Major,Adviser, (CourseNo,Ctitle,InstrucName,InstructLocn,Grade)))

Functional dependencies

StudNo -> StudName CourseNo -> Ctitle,InstrucName InstrucName -> InstrucLocn StudNo,CourseNo,Major -> Grade StudNo,Major -> Advisor Advisor -> Major

Unnormalised

Grade_report(StudNo,StudName,(Major,Advisor, (CourseNo,Ctitle,InstrucName,InstructLocn,Grade)))

1NF Remove repeating groups

Student(StudNo,StudName)StudMajor(StudNo,Major,Advisor)StudCourse(StudNo,Major,CourseNo, Ctitle,InstrucName,InstructLocn,Grade)

2NF Remove partial key dependencies

Student(StudNo,StudName)StudMajor(StudNo,Major,Advisor)StudCourse(StudNo,Major,CourseNo,Grade)Course(CourseNo,Ctitle,InstrucName,InstructLocn)

3NF Remove transitive dependencies

Student(StudNo,StudName)StudMajor(StudNo,Major,Advisor)StudCourse(StudNo,Major,CourseNo,Grade)Course(CourseNo,Ctitle,InstrucName)Instructor(InstructName,InstructLocn)

BCNF Every determinant is a candidate key Student : only determinant is StudNo StudCourse: only determinant is StudNo,Major Course: only determinant is CourseNo Instructor: only determinant is InstrucName StudMajor: the determinants are StudNo,Major, or Adviser

Only StudNo,Major is a candidate key.

BCNF

Student(StudNo,StudName)StudCourse(StudNo,Major,CourseNo,Grade)Course(CourseNo,Ctitle,InstrucName)Instructor(InstructName,InstructLocn)StudMajor(StudNo,Advisor)Adviser(Adviser,Major)

Problems BCNF overcomes

STUDENT MAJOR ADVISOR

123 PHYSICS EINSTEIN

123 MUSIC MOZART

456 BIOLOGY DARWIN

789 PHYSICS BOHR

999 PHYSICS EINSTEIN

If the record for student 456 is deleted we lose not only information on student 456 but also the fact that DARWIN advises in BIOLOGY

we cannot record the fact that WATSON can advise on COMPUTING until we have a student majoring in COMPUTING to whom we can assign WATSON as an advisor.

In BCNF we have two tables:

STUDENT ADVISOR

123 EINSTEIN

123 MOZART

456 DARWIN

789 BOHR

999 EINSTEIN

ADVISOR MAJOR

EINSTEIN PHYSICS

MOZART MUSIC

DARWIN BIOLOGY

BOHR PHYSICS

Relational Database Language - Structured Query LanguageSQL is a popular relational database language first standardized in 1986 by the American National Standards Institute (ANSI). Since then, it has been formally adopted as an International Standard by the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). It has also been adopted as a Federal Information Processing Standard (FIPS) for the U.S. federal government.

Database Language SQL is under continual development by the above mentioned standardization bodies. The most recent published version was in 1992, a 580 page specification published by ANSI as American National Standard X3.135-1992 and by ISO/IEC as International Standard 9075:1992. The two specifications are word-for-word identical. Both versions are available in hard copy only from ANSI (Sales telephone: +1-212-642-4900). Further extensions and enhancements are under development

DESCRIPTION of SQLThe basic structure of the relational model is a table, consisting of rows and columns. Data definition includes declaring the name of each table to be included in a database, the names and data types of all columns of each table, constraints on the values in and among columns, and the granting of table manipulation privileges to prospective users. Tables can be accessed by inserting new rows, deleting or updating existing rows, or selecting rows that satisfy a given search condition for output. Tables can be manipulated to produce new tables by Cartesian products, unions, intersections, joins on matching columns, or projections on given columns.

SQL data manipulation operations may be invoked through a cursor or through a general query specification. The language includes all arithmetic operations, predicates for comparison and string matching, universal and existential quantifiers, summary operations for max/min or count/sum, and GROUP BY and HAVING clause to partition tables by groups. Transaction management is achieved through COMMIT and ROLLBACK statements.

The standard provides language facilities for defining application specific views of the data. Each view is the specification of database operations that would produce a desired table. The viewed table is then materialized at application execution time.

The SQL standard provides a Module Language for interface to other languages. Each SQL statement may be packaged as a procedure that can be called and have parameters passed to it from an external language. A cursor mechanism provides row-at-a-time access from languages that can only handle one row of a table at one time.

Access control is provided by GRANT and REVOKE statements. Each prospective user must be explicitly granted the privilege to access a specific table or view using a specific statement.

The SQL Integrity Enhancement facility offers additional tools for referential integrity, CHECK constraint clauses, and DEFAULT clauses. Referential integrity allows specification of primary and foreign keys with the requirement that no foreign key row may be inserted or updated unless a matching primary key row exists. Check clauses allow specification of inter-column constraints to be maintained by the database system. Default clauses provide optional default values for missing data.

The Embedded SQL specification provides SQL interface to programming languages, specifically Ada, C, COBOL, FORTRAN, MUMPS, Pascal, and PL/I. Applications may thereby integrate program control structures with SQL data manipulation capabilities. The Embedded SQL syntax is just a shorthand for an explicit SQL Module accessed from a standard conforming programming language.

SQL-92 significantly increases the size of the original 1986 standard to include a schema manipulation language for modifying or altering schemas, schema information tables to make schema definitions accessible to users, new facilities for dynamic creation of SQL statements, and new data types and domains. Other new SQL-92

features include outer join, cascade update and delete referential actions, set algebra on tables, transaction consistency levels, scrolled cursors ,deferred constraint checking, and greatly expanded exception reporting. SQL-92 also removes a number of restrictions in order to make the language more flexible and orthogonal.

Why is it needed?

When data is stored on disk based storage devices, it is stored as blocks of data. These blocks are accessed in their entirety, making them the atomic disk access operation. Disk blocks are structured in much the same way as linked lists; both contain a section for data, a pointer to the location of the next node (or block), and both need not be stored contiguously.

Due to the fact that a number of records can only be sorted on one field, we can state that searching on a field that isn’t sorted requires a Linear Search which requires N/2 block accesses, where N is the number of blocks that the table spans. If that field is a non-key field (i.e. doesn’t contain unique entries) then the entire table space must be searched at N block accesses.

Whereas with a sorted field, a Binary Search may be used, this has log2 N block accesses. Also since the data is sorted given a non-key field, the rest of the table doesn’t need to be searched for duplicate values, once a higher value is found. Thus the performance increase is substantial.

What is indexing?

Indexing is a way of sorting a number of records on multiple fields. Creating an index on a field in a table creates another data structure which holds the field value, and pointer to the record it relates to. This index structure is then sorted, allowing Binary Searches to be performed on it.

The downside to indexing is that these indexes require additional space on the disk, since the indexes are stored together in a MyISAM database, this file can quickly reach the size limits of the underlying file system if many fields within the same table are indexed.

How does it work?

Firstly, let’s outline a sample database table schema;

Field name Data type Size on disk

id (Primary key) Unsigned INT 4 bytes

firstName Char(50) 50 bytes

lastName Char(50) 50 bytes

emailAddress Char(100) 100 bytes

Note: char was used in place of varchar to allow for an accurate size on disk value. This sample database contains five million rows, and is unindexed. The performance of several queries will now be analyzed. These are a query using the id (a sorted key field) and one using the firstName (a non-key unsorted field).

Example 1

Given our sample database of r = 5,000,000 records of a fixed size giving a record length of R = 204 bytes and they are stored in a MyISAM database which is using the default block size B = 1,024 bytes. The blocking factor of the table would be bfr = (B/R) = 1024/204 = 5 records per disk block. The total number of blocks required to hold the table is N = (r/bfr) = 5000000/5 = 1,000,000 blocks.

A linear search on the id field would require an average of N/2 = 500,000 block accesses to find a value given that the id field is a key field. But since the id field is also sorted a binary search can be conducted requiring an average of log2 1000000 = 19.93 = 20 block accesses. Instantly we can see this is a drastic improvement.

Now the firstName field is neither sorted, so a binary search is impossible, nor are the values unique, and thus the table will require searching to the end for an exact N = 1,000,000 block accesses. It is this situation that indexing aims to correct.

Given that an index record contains only the indexed field and a pointer to the original record, it stands to reason that it will be smaller than the multi-field record that it points to. So the index itself requires fewer disk blocks that the original table, which therefore requires fewer block accesses to iterate through. The schema for an index on the firstName field is outlined below;

Field name Data type Size on disk

firstName Char(50) 50 bytes

(record pointer) Special 4 bytes

Note: Pointers in MySQL are 2, 3, 4 or 5 bytes in length depending on the size of the table.

Example 2

Given our sample database of r = 5,000,000 records with an index record length of R = 54 bytes and using the default block size B = 1,024 bytes. The

blocking factor of the index would be bfr = (B/R) = 1024/54 = 18 records per disk block. The total number of blocks required to hold the table is N = (r/bfr) = 5000000/18 = 277,778 blocks.

Now a search using the firstName field can utilise the index to increase performance. This allows for a binary search of the index with an average of log2 277778 = 18.08 = 19 block accesses. To find the address of the actual record, which requires a further block access to read, bringing the total to 19 + 1 = 20 block accesses, a far cry from the 277,778 block accesses required by the non-indexed table.

When should it be used?

Given that creating an index requires additional disk space (277,778 blocks extra from the above example), and that too many indexes can cause issues arising from the file systems size limits, careful thought must be used to select the correct fields to index.

Since indexes are only used to speed up the searching for a matching field within the records, it stands to reason that indexing fields used only for output would be simply a waste of disk space and processing time when doing an insert or delete operation, and thus should be avoided. Also given the nature of a binary search, the cardinality or uniqueness of the data is important. Indexing on a field with a cardinality of 2 would split the data in half, whereas a cardinality of 1,000 would return approximately 1,000 records. With such a low cardinality the effectiveness is reduced to a linear sort, and the query optimizer will avoid using the index if the cardinality is less than 30% of the record number, effectively making the index a waste of space

Overview of Physical Storage Media

1. Several types of data storage exist in most computer systems. They vary in speed of access, cost per unit of data, and reliability.

o Cache: most costly and fastest form of storage. Usually very small, and managed by the operating system.

o Main Memory (MM): the storage area for data available to be operated on.

General-purpose machine instructions operate on main memory. Contents of main memory are usually lost in a power failure or

``crash''. Usually too small (even with megabytes) and too expensive to

store the entire database. o Flash memory: EEPROM (electrically erasable programmable read-

only memory).

Data in flash memory survive from power failure. Reading data from flash memory takes about 10 nano-secs

(roughly as fast as from main memory), and writing data into flash memory is more complicated: write-once takes about 4-10 microsecs.

To overwrite what has been written, one has to first erase the entire bank of the memory. It may support only a limited number of erase cycles ( to ).

It has found its popularity as a replacement for disks for storing small volumes of data (5-10 megabytes).

o Magnetic-disk storage: primary medium for long-term storage. Typically the entire database is stored on disk. Data must be moved from disk to main memory in order for the

data to be operated on. After operations are performed, data must be copied back to disk

if any changes were made. Disk storage is called direct access storage as it is possible to

read data on the disk in any order (unlike sequential access). Disk storage usually survives power failures and system crashes.

o Optical storage: CD-ROM (compact-disk read-only memory), WORM (write-once read-many) disk (for archival storage of data), and Juke box (containing a few drives and numerous disks loaded on demand).

o Tape Storage: used primarily for backup and archival data. Cheaper, but much slower access, since tape must be read

sequentially from the beginning. Used as protection from disk failures!

2. The storage device hierarchy is presented in Figure 10.1, where the higher levels are expensive (cost per bit), fast (access time), but the capacity is smaller.

Figure 10.1: Storage-device hierarchy

3. Another classification: Primary, secondary, and tertiary storage.

http://www.cs.sfu.ca/CourseCentral/354/zaiane/material/notes/Chapter10/node2.html#fig101store

1. Primary storage: the fastest storage media, such as cash and main memory.

2. Secondary (or on-line) storage: the next level of the hierarchy, e.g., magnetic disks.

3. Tertiary (or off-line) storage: magnetic tapes and optical disk juke boxes.

Volatility of storage. Volatile storage loses its contents when the power is removed. Without power backup, data in the volatile storage (the part of the hierarchy from main memory up) must be written to nonvolatile storage for safekeeping.

Physical Characteristics of Disks

1. The storage capacity of a single disk ranges from 10MB to 10GB. A typical commercial database may require hundreds of disks.

2. Figure 10.2 shows a moving-head disk mechanism. o Each disk platter has a flat circular shape. Its two surfaces are covered

with a magnetic material and information is recorded on the surfaces. The platter of hard disks are made from rigid metal or glass, while floppy disks are made from flexible material.

o The disk surface is logically divided into tracks, which are subdivided into sectors. A sector (varying from 32 bytes to 4096 bytes, usually 512 bytes) is the smallest unit of information that can be read from or written to disk. There are 4-32 sectors per track and 20-1500 tracks per disk surface.

o The arm can be positioned over any one of the tracks. o The platter is spun at high speed. o To read information, the arm is positioned over the correct track. o When the data to be accessed passes under the head, the read or write

operation is performed. 3. A disk typically contains multiple platters (see Figure 10.2). The read-write

heads of all the tracks are mounted on a single assembly called a disk arm, and move together.

o Multiple disk arms are moved as a unit by the actuator. o Each arm has two heads, to read disks above and below it. o The set of tracks over which the heads are located forms a cylinder. o This cylinder holds that data that is accessible within the disk latency

time. o It is clearly sensible to store related data in the same or adjacent

cylinders. 4. Disk platters range from 1.8" to 14" in diameter, and 5"1/4 and 3"1/2 disks

dominate due to the lower cost and faster seek time than do larger disks, yet they provide high storage capacity.

5. A disk controller interfaces between the computer system and the actual hardware of the disk drive. It accepts commands to r/w a sector, and initiate

actions. Disk controllers also attach checksums to each sector to check read error.

6. Remapping of bad sectors: If a controller detects that a sector is damaged when the disk is initially formatted, or when an attempt is made to write the sector, it can logically map the sector to a different physical location.

7. SCSI (Small Computer System Interconnect) is commonly used to connect disks to PCs and workstations. Mainframe and server systems usually have a faster and more expensive bus to connect to the disks.

8. Head crash: why cause the entire disk failing (?). 9. A fixed dead disk has a separate head for each track -- very many heads, very

expensive. Multiple disk arms: allow more than one track to be accessed at a time. Both were used in high performance mainframe systems but are relatively rare today.

Performance Measures of Disks

The main measures of the qualities of a disk are capacity, access time, data transfer rate, and reliability,

1. access time: the time from when a read or write request is issued to when data transfer begins. To access data on a given sector of a disk, the arm first must move so that it is positioned over the correct track, and then must wait for the sector to appear under it as the disk rotates. The time for repositioning the arm is called seek time, and it increases with the distance the arm must move. Typical seek time range from 2 to 30 milliseconds.

Average seek time is the average of the seek time, measured over a sequence of (uniformly distributed) random requests, and it is about one third of the worst-case seek time.

Once the seek has occurred, the time spent waiting for the sector to be accesses to appear under the head is called rotational latency time. Average rotational latency time is about half of the time for a full rotation of the disk. (Typical rotational speeds of disks ranges from 60 to 120 rotations per second).

The access time is then the sum of the seek time and the latency and ranges from 10 to 40 milli-sec.

2. data transfer rate, the rate at which data can be retrieved from or stored to the disk. Current disk systems support transfer rate from 1 to 5 megabytes per second.

3. reliability, measured by the mean time to failure. The typical mean time to failure of disks today ranges from 30,000 to 800,000 hours (about 3.4 to 91 years).

Optimization of Disk-Block Access

1. Data is transferred between disk and main memory in units called blocks. 2. A block is a contiguous sequence of bytes from a single track of one platter. 3. Block sizes range from 512 bytes to several thousand. 4. The lower levels of file system manager covert block addresses into the

hardware-level cylinder, surface, and sector number. 5. Access to data on disk is several orders of magnitude slower than is access to

data in main memory. Optimization techniques besides buffering of blocks in main memory.

o Scheduling: If several blocks from a cylinder need to be transferred, we may save time by requesting them in the order in which they pass under the heads. A commonly used disk-arm scheduling algorithm is the elevator algorithm.

o File organization. Organize blocks on disk in a way that corresponds closely to the manner that we expect data to be accessed. For example, store related information on the same track, or physically close tracks, or adjacent cylinders in order to minimize seek time. IBM mainframe OS's provide programmers fine control on placement of files but increase programmer's burden.

UNIX or PC OSs hide disk organizations from users. Over time, a sequential file may become fragmented. To reduce fragmentation, the system can make a back-up copy of the data on disk and restore the entire disk. The restore operation writes back the blocks of each file continuously (or nearly so). Some systems, such as MS-DOS, have utilities that scan the disk and then move blocks to decrease the fragmentation.

o Nonvolatile write buffers. Use nonvolatile RAM (such as battery-back-up RAM) to speed up disk writes drastically (first write to nonvolatile RAM buffer and inform OS that writes completed).

o Log disk. Another approach to reducing write latency is to use a log disk, a disk devoted to writing a sequential log. All access to the log disk is sequential, essentially eliminating seek time, and several consecutive blocks can be written at once, making writes to log disk several times faster than random writes.

Optical Disks

1. CD-ROM has become a popular medium for distributing software, multimedia data, and other electronic published information.

2. Capacity of CD-ROM: 500 MB. Disks are cheap to mass produce and also drives.

3. CD-ROM: much longer seek time (250m-sec), lower rotation speed (400 rpm), leading to high latency and lower data-transfer rate (about 150 KB/sec). Drives

spins at audio CD spin speed (standard) is available. 4. Recently, a new optical format, digit video disk (DVD) has become standard.

These disks hold between 4.7 and 17 GB data. 5. WORM (write-once, read many) disks are popular for archival storage of data

since they have a high capacity (about 500 MB), longer life time than HD, and can be removed from drive -- good for audit trail (hard to tamper).

Magnetic Tapes

1. Long history, slow, and limited to sequential access, and thus are used for backup, storage for infrequent access, and off-line medium for system transfer.

2. Moving to the correct spot may take minutes, but once positioned, tape drives can write data at density and speed approaching to those of disk drives.

3. 8mm tape drive has the highest density, and we store 5 GB data on a 350-foot tape.

4. Popularly used for storage of large volumes of data, such as video, image, or remote sensing data.

Storage Access

1. Each file is partitioned into fixed-length storage units, called blocks, which are the units of both storage allocation and data transfer.

2. It is desirable to keep as many blocks as possible in main memory. Usually, we cannot keep all blocks in main memory, so we need to manage the allocation of available main memory space.

3. We need to use disk storage for the database, and to transfer blocks of data between main memory and disk. We also want to minimize the number of such transfers, as they are time-consuming.

4. The buffer is the part of main memory available for storage of copies of disk blocks.

Buffer manager

1. The subsystem responsible for the allocation of buffer space is called the buffer manager.

o The buffer manager handles all requests for blocks of the database. o If the block is already in main memory, the address in main memory is

given to the requester. o If not, the buffer manager must read the block in from disk (possibly

displacing some other block if the buffer is full) and then pass the address in main memory to the requester.

2. The buffer manager must use some sophisticated techniques in order to provide good service:

o Replacement Strategy -- When there is no room left in the buffer, some block must be removed to make way for the new one. Typical operating system memory management schemes use a ``least recently used'' (LRU) method. (Simply remove the block least recently referenced.) This can be improved upon for database applications.

o Pinned Blocks - For the database to be able to recover from crashes, we need to restrict times when a block maybe written back to disk. A block not allowed to be written is said to be pinned. Many operating systems do not provide support for pinned blocks, and such a feature is essential if a database is to be ``crash resistant''.

o Forced Output of Blocks - Sometimes it is necessary to write a block back to disk even though its buffer space is not needed, (called the forced output of a block.) This is due to the fact that main memory contents (and thus the buffer) are lost in a crash, while disk data usually survives.

Buffer replacement policies

1. Replacement Strategy: Goal is minimization of accesses to disk. Generally it is hard to predict which blocks will be referenced. So operating systems use the history of past references as a guide to prediction.

o General Assumption: Blocks referenced recently are likely to be used again.

o Therefore, if we need space, throw out the least recently referenced block (LRU replacement scheme).

2. LRU is acceptable in operating systems, however, a database system is able to predict future references more accurately.

3. Consider processing of the relational algebra expression

4. Further, assume the strategy to process this request is given by the following pseudo-code:

aaaaaaaaaaaa¯for each tuple b of borrower do

for each tuple c of customer do

if b[cname] = c[cname]

then begin

let x be a tuple defined as follows:

x[cname]:= b[cname] x[loan#]:= b[loan#] x[street]:= c[street] x[city]:= c[city]

include tuple x as part of result of borrow customer

end

end end

5. Assume that the two relations in this example are stored in different files. o Once a tuple of borrower has been processed, it is not needed again.

Therefore, once processing of an entire block of tuples is finished, that block is not needed in main memory, even though it has been used very recently.

o Buffer manager should free the space occupied by a borrow block as soon as it is processed. This strategy is called toss-immediate.

o Consider blocks containing customer tuples. o Every block of customer tuples must be examined once for every tuple

of the borrow relation. When processing of a customer block is completed, it will not be used again until all other customer blocks have been processed. This means the most recently used (MRU) block will be the last block to be re-referenced, and the least recently used will be referenced next.

o This is the opposite of LRU assumptions. So for inner block, use MRU strategy -- if a customer block must be removed from the buffer, choose MRU block.

o For MRU strategy, the system must pin the customer block currently being processed until the last tuple has been processed. Then it is unpinned, becoming the most recently used block.

6. The buffer manager may also use statistical information regarding the probability that a request will reference a particular relation.

o The data dictionary is the most frequently-used part of the database. It should, therefore, not be removed from main memory unless necessary.

o File indices are also frequently used, and should generally be in main memory.

o No single strategy is known that handles all possible scenarios well. o Many database systems use LRU, despite of its faults. o Concurrency and recovery may need other buffer management

strategies, such as delayed buffer-out or forced output.

File Organization

1. A file is organized logically as a sequence of records. 2. Records are mapped onto disk blocks. 3. Files are provided as a basic construct in operating systems, so we assume the

existence of an underlying file system. 4. Blocks are of a fixed size determined by the operating system. 5. Record sizes vary.

6. In relational database, tuples of distinct relations may be of different sizes. 7. One approach to mapping database to files is to store records of one length in a

given file. 8. An alternative is to structure files to accommodate variable-length records.

(Fixed-length is easier to implement.)

Fixed-Length Records

Consider a file of deposit records of the form:

aaaaaaaaaaaa¯type deposit = record

bname : char(22);

account# : char(10);

balance : real;

end

o If we assume that each character occupies one byte, an integer occupies 4 bytes, and a real 8 bytes, our deposit record is 40 bytes long.

o The simplest approach is to use the first 40 bytes for the first record, the next 40 bytes for the second, and so on.

o However, there are two problems with this approach. o It is difficult to delete a record from this structure. o Space occupied must somehow be deleted, or we need to mark deleted

records so that they can be ignored. o Unless block size is a multiple of 40, some records will cross block

boundaries. o It would then require two block accesses to read or write such a record.

2. When a record is deleted, we could move all successive records up one (Figure 10.7), which may require moving a lot of records.

o We could instead move the last record into the ``hole'' created by the deleted record (Figure 10.8).

o This changes the order the records are in. o It turns out to be undesirable to move records to occupy freed space, as

moving requires block accesses. o Also, insertions tend to be more frequent than deletions. o It is acceptable to leave the space open and wait for a subsequent

insertion. o This leads to a need for additional structure in our file design.

3. So one solution is:

o At the beginning of a file, allocate some bytes as a file header. o This header for now need only be used to store the address of the first

record whose contents are deleted. o This first record can then store the address of the second available

record, and so on (Figure 10.9). o To insert a new record, we use the record pointed to by the header, and

change the header pointer to the next available record. o If no deleted records exist we add our new record to the end of the file.

4. Note: Use of pointers requires careful programming. If a record pointed to is moved or deleted, and that pointer is not corrected, the pointer becomes a dangling pointer. Records pointed to are called pinned.

5. Fixed-length file insertions and deletions are relatively simple because ``one size fits all''. For variable length, this is not the case

Variable-Length Records

1. Variable-length records arise in a database in several ways: o Storage of multiple items in a file. o Record types allowing variable field size o Record types allowing repeating fields

We'll look at several techniques, using one example with a variable-length record:

aaaaaaaaaaaa¯type account-list = record

bname : char(22);

account-info : array of record;

account#: char(10); balance: real;

end

end

Account-information is an array with an arbitrary number of elements.

Byte string representation

1. Attach a special end-of-record symbol ( ) to the end of each record. Each record is stored as a string of successive bytes (See Figure 10.10).

Byte string representation has several disadvantages:

o It is not easy to re-use space left by a deleted record o In general, there is no space for records to grow longer. (Must move to

expand, and record may be pinned.)

So this method is not usually used.

2. An interesting structure: Slot page structure.

There is a header at the beginning of each block, containing:

o # of record entires in the header o the end of free space in the block o an array whose entries contain the location and size of each record.

3. The slot page structure requires that there be no pointers that point directly to records. Instead, pointers must point to the entry in the header that contains the actual location of the record. This level of indirection allows records to be moved to prevent fragmentation of space inside a block, while supporting indirect pointers to the record.

Fixed-length representation

1. Uses one or more fixed-length records to represent one variable-length record. 2. Two techniques:

o Reserved space - uses fixed-length records large enough to accommodate the largest variable-length record. (Unused space filled with end-of-record symbol.)

o Pointers - represent by a list of fixed-length records, chained together. 3. The reserved space method requires the selection of some maximum record

length. (Figure 10.12)

If most records are of near-maximum length this method is useful. Otherwise, space is wasted.

4. Then the pointer method may be used (Figure 10.13). Its disadvantage is that space is wasted in successive records in a chain as non-repeating fields are still present.

5. To overcome this last disadvantage we can split records into two blocks (See Figure 10.14)

o Anchor block - contains first records of a chain o Overflow block - contains records other than first in the chain.

Now all records in a block have the same length, and there is no wasted space.

Organization of Records in Files

There are several ways of organizing records in files.

heap file organization. Any record can be placed anywhere in the file where there is space for the record. There is no ordering of records.

sequential file organization. Records are stored in sequential order, based on the value of the search key of each record.

hashing file organization. A hash function is computed on some attribute of each record. The result of the function specifies in which block of the file the record should be placed -- to be discussed in chapter 11 since it is closely related to the indexing structure.

clustering file organization. Records of several different relations can be stored in the same file. Related records of the different relations are stored on the same block so that one I/O operation fetches related records from all the relations.

Sequential File Organization

1. A sequential file is designed for efficient processing of records in sorted order on some search key.

o Records are chained together by pointers to permit fast retrieval in search key order.

o Pointer points to next record in order. o Records are stored physically in search key order (or as close to this as

possible). o This minimizes number of block accesses. o Figure 10.15 shows an example, with bname as the search key.

2. It is difficult to maintain physical sequential order as records are inserted and deleted.

o Deletion can be managed with the pointer chains. o Insertion poses problems if no space where new record should go. o If space, use it, else put new record in an overflow block. o Adjust pointers accordingly. o Figure 10.16 shows the previous example after an insertion. o Problem: we now have some records out of physical sequential order.

o If very few records in overflow blocks, this will work well. o If order is lost, reorganize the file. o Reorganizations are expensive and done when system load is low.

3. If insertions rarely occur, we could keep the file in physically sorted order and reorganize when insertion occurs. In this case, the pointer fields are no longer required.

Clustering File Organization

1. One relation per file, with fixed-length record, is good for small databases, which also reduces the code size.

2. Many large-scale DB systems do not rely directly on the underlying operating system for file management. One large OS file is allocated to DB system and all relations are stored in one file.

3. To efficiently execute queries involving , one may store the depositor tuple for each cname near the customer tuple for the corresponding cname, as shown in Figure 10.19.

4. This structure mixes together tuples from two relations, but allows for efficient processing of the join.

5. If the customer has many accounts which cannot fit in one block, the remaining records appear on nearby blocks. This file structure, called clustering, allows us to read many of the required records using one block read.

6. Our use of clustering enhances the processing of a particular join but may result in slow processing of other types of queries, such as selection on customer.

For example, the query

aaaaaaaaaaaa¯select * from customer

now requires more block accesses as our customer relation is now interspersed with the deposit relation.

7. Thus it is a trade-off, depending on the types of query that the database designer believes to be most frequent. Careful use of clustering may produce significant performance gain.

Data Dictionary Storage

1. The database also needs to store information about the relations, known as the data dictionary. This includes:

o Names of relations. o Names of attributes of relations. o Domains and lengths of attributes. o Names and definitions of views. o Integrity constraints (e.g., key constraints).

plus data on the system users:

o Names of authorized users. o Accounting information about users.

plus (possibly) statistical and descriptive data:

o Number of tuples in each relation. o Method of storage used for each relation (e.g., clustered or non-

clustered). 2. When we look at indices (Chapter 11), we'll also see a need to store

information about each index on each relation: o Name of the index. o Name of the relation being indexed. o Attributes the index is on. o Type of index.

3. This information is, in itself, a miniature database. We can use the database to store data about itself, simplifying the overall structure of the system, and allowing the full power of the database to be used to permit fast access to system data.

4. The exact choice of how to represent system data using relations must be made by the system designer. One possible representation follows.

aaaaaaaaaaaa¯System-catalog-schema = (relation-name, number-attrs)

Attr-schema = (attr-name, rel-name, domain-type, position, length)

User-schema = (user-name, encrypted-password, group)

Index-schema = (index-name, rel-name, index-type, index-attr)

View-schema = (view-name, definition)

INDEXES

1. An index for a file works like a catalogue in a library. Cards in alphabetic order tell us where to find books by a particular author.

2. In real-world databases, indices like this might be too large to be efficient. We'll look at more sophisticated indexing techniques.

3. There are two kinds of indices. o Ordered indices: indices are based on a sorted ordering of the values. o Hash indices: indices are based on the values being distributed

uniformly across a range of buckets. The buckets to which a value is assigned is determined by a function, called a hash function.

4. We will consider several indexing techniques. No one technique is the best. Each technique is best suited for a particular database application.

5. Methods will be evaluated on:

1. Access Types -- types of access that are supported efficiently, e.g., value-based search or range search.

2. Access Time -- time to find a particular data item or set of items. 3. Insertion Time -- time taken to insert a new data item (includes time to

find the right place to insert). 4. Deletion Time -- time to delete an item (includes time taken to find

item, as well as to update the index structure). 5. Space Overhead -- additional space occupied by an index structure.

We may have more than one index or hash function for a file. (The library may have card catalogues by author, subject or title.) The attribute or set of attributes used to look up records in a file is called the search key (not to be confused with primary key, etc.).

Ordered Indices

1. In order to allow fast random access, an index structure may be used. 2. A file may have several indices on different search keys. 3. If the file containing the records is sequentially ordered, the index whose

search key specifies the sequential order of the file is the primary index, or clustering index. Note: The search key of a primary index is usually the primary key, but it is not necessarily so.

4. Indices whose search key specifies an order different from the sequential order of the file are called the secondary indices, or nonclustering indices.

Primary Index

1. Index-sequential files: Files are ordered sequentially on some search key, and a primary index is associated with it.

Figure 11.1: Sequential file for deposit records.

Dense and Sparse Indices

1. There are Two types of ordered indices:

Dense Index:

o An index record appears for every search key value in file. o This record contains search key value and a pointer to the actual record.

Sparse Index:

o Index records are created only for some of the records. o To locate a record, we find the index record with the largest search key

value less than or equal to the search key value we are looking for. o We start at that record pointed to by the index record, and proceed along

the pointers in the file (that is, sequentially) until we find the desired record.

2. Figures 11.2 and 11.3 show dense and sparse indices for the deposit file.

Figure 11.2: Dense index.

3. Notice how we would find records for Perryridge branch using both methods. (Do it!)

Figure 11.3: Sparse index.

http://www.cs.sfu.ca/CourseCentral/354/zaiane/material/notes/Chapter11/node5.html#figx13sparse

http://www.cs.sfu.ca/CourseCentral/354/zaiane/material/notes/Chapter11/node5.html#figx12dense

4. Dense indices are faster in general, but sparse indices require less space and impose less maintenance for insertions and deletions. (Why?)

5. A good compromise: to have a sparse index with one entry per block.

Why is this good?

o Biggest cost is in bringing a block into main memory. o We are guaranteed to have the correct block with this method, unless

record is on an overflow block (actually could be several blocks). o Index size still small.

Multi-Level Indices

1. Even with a sparse index, index size may still grow too large. For 100,000 records, 10 per block, at one index record per block, that's 10,000 index records! Even if we can fit 100 index records per block, this is 100 blocks.

2. If index is too large to be kept in main memory, a search results in several disk reads.

o If there are no overflow blocks in the index, we can use binary search.

o This will read as many as blocks (as many as 7 for our 100 blocks).

o If index has overflow blocks, then sequential search typically used, reading all b index blocks.

3. Solution: Construct a sparse index on the index (Figure 11.4).

http://www.cs.sfu.ca/CourseCentral/354/zaiane/material/notes/Chapter11/node6.html#figx142lvlspa

Figure 11.4: Two-level sparse index.

4. Use binary search on outer index. Scan index block found until correct index record found. Use index record as before - scan block pointed to for desired record.

5. For very large files, additional levels of indexing may be required. 6. Indices must be updated at all levels when insertions or deletions require it. 7. Frequently, each level of index corresponds to a unit of physical storage (e.g.

indices at the level of track, cylinder and disk).

Index Update

Regardless of what form of index is used, every index must be updated whenever a record is either inserted into or deleted from the file.

1. Deletion: o Find (look up) the record o If the last record with a particular search key value, delete that search

key value from index. o For dense indices, this is like deleting a record in a file. o For sparse indices, delete a key value by replacing key value's entry in

index by next search key value. If that value already has an index entry, delete the entry.

2. Insertion: o Find place to insert. o Dense index: insert search key value if not present.

o Sparse index: no change unless new block is created. (In this case, the first search key value appearing in the new block is inserted into the index).

Secondary Indices

1. If the search key of a secondary index is not a candidate key, it is not enough to point to just the first record with each search-key value because the remaining records with the same search-key value could be anywhere in the file. Therefore, a secondary index must contain pointers to all the records.

Figure 11.5: Sparse secondary index on cname.

2. We can use an extra-level of indirection to implement secondary indices on search keys that are not candidate keys. A pointer does not point directly to the file but to a bucket that contains pointers to the file.

o See Figure 11.5 on secondary key cname. o To perform a lookup on Peterson, we must read all three records pointed

to by entries in bucket 2. o Only one entry points to a Peterson record, but three records need to be

read. o As file is not ordered physically by cname, this may take 3 block

accesses. 3. Secondary indices must be dense, with an index entry for every search-key

value, and a pointer to every record in the file. 4. Secondary indices improve the performance of queries on non-primary keys. 5. They also impose serious overhead on database modification: whenever a file

is updated, every index must be updated. 6. Designer must decide whether to use secondary indices or not.

B -Tree Index Files

1. Primary disadvantage of index-sequential file organization is that performance degrades as the file grows. This can be remedied by costly re-organizations.

http://www.cs.sfu.ca/CourseCentral/354/zaiane/material/notes/Chapter11/node8.html#figx16spar2nd

2. B -tree file structure maintains its efficiency despite frequent insertions and deletions. It imposes some acceptable update and space overheads.

3. A B -tree index is a balanced tree in which every path from the root to a leaf is of the same length.

4. Each nonleaf node in the tree must have between and n children, where n is fixed for a particular tree.

Structure of a B -Tree

1. A B -tree index is a multilevel index but is structured differently from that of multi-level index sequential files.

2. A typical node (Figure 11.6) contains up to n-1 search key values

, and n pointers . Search key values in a node are kept in sorted order.

Figure 11.6: Typical node of a B+-tree.

3. For leaf nodes, ( ) points to either a file record with search key

value , or a bucket of pointers to records with that search key value. Bucket structure is used if search key is not a primary key, and file is not sorted in search key order.

Pointer (nth pointer in the leaf node) is used to chain leaf nodes together in linear order (search key order). This allows efficient sequential processing of the file.

The range of values in each leaf do not overlap.

4. Non-leaf nodes form a multilevel index on leaf nodes.

A non-leaf node may hold up to n pointers and must hold pointers. The number of pointers in a node is called the fan-out of the node.

Consider a node containing m pointers. Pointer ( ) points to a

subtree containing search key values and . Pointer points to a

subtree containing search key values . Pointer points to a subtree

containing search key values .

5. Figures 11.7 (textbook Fig. 11.8) and textbook Fig. 11.9 show B -trees for the deposit file with n=3 and n=5.

http://www.cs.sfu.ca/CourseCentral/354/zaiane/material/notes/Chapter11/node10.html#figx19btree

http://www.cs.sfu.ca/CourseCentral/354/zaiane/material/notes/Chapter11/node10.html#figx1bplustreenode

Figure 11.7: B+-tree for deposit file with n = 3.

Queries on B -Trees

1. Suppose we want to find all records with a search key value of k.

o Examine the root node and find the smallest search key value .

o Follow pointer to another node.

o If follow pointer . o Otherwise, find the appropriate pointer to follow. o Continue down through non-leaf nodes, looking for smallest search key

value > k and following the corresponding pointer. o Eventually we arrive at a leaf node, where pointer will point to the

desired record or bucket. 2. In processing a query, we traverse a path from the root to a leaf node. If there

are K search key values in the file, this path is no longer than .

This means that the path is not long, even in large files. For a 4k byte disk block with a search-key size of 12 bytes and a disk pointer of 8 bytes, n is around 200. If n =100, a look-up of 1 million search-key values may take

nodes to be accessed. Since root is in usually in the buffer, so typically it takes only 3 or fewer disk reads.

Updates on B -Trees

1. Insertions and Deletions:

Insertion and deletion are more complicated, as they may require splitting or combining nodes to keep the tree balanced. If splitting or combining are not required, insertion works as follows:

o Find leaf node where search key value should appear. o If value is present, add new record to the bucket. o If value is not present, insert value in leaf node (so that search keys are

still in order). o Create a new bucket and insert the new record.

If splitting or combining are not required, deletion works as follows:

o Deletion: Find record to be deleted, and remove it from the bucket.

o If bucket is now empty, remove search key value from leaf node. 2. Insertions Causing Splitting:

When insertion causes a leaf node to be too large, we split that node. In Figure 11.8, assume we wish to insert a record with a bname value of ``Clearview''.

o There is no room for it in the leaf node where it should appear. o We now have n values (the n-1 search key values plus the new one we

wish to insert).

o We put the first values in the existing node, and the remainder into a new node.

o Figure 11.10 shows the result. o The new node must be inserted into the B -tree. o We also need to update search key values for the parent (or higher)

nodes of the split leaf node. (Except if the new node is the leftmost one) o Order must be preserved among the search key values in each node. o If the parent was already full, it will have to be split. o When a non-leaf node is split, the children are divided among the two

new nodes. o In the worst case, splits may be required all the way up to the root. (If

the root is split, the tree becomes one level deeper.) o Note: when we start a B -tree, we begin with a single node that is both

the root and a single leaf. When it gets full and another insertion occurs, we split it into two leaf nodes, requiring a new root.

3. Deletions Causing Combining:

Deleting records may cause tree nodes to contain too few pointers. Then we must combine nodes.

o If we wish to delete ``Downtown'' from the B -tree of Figure 11.11, this occurs.

o In this case, the leaf node is empty and must be deleted. o If we wish to delete ``Perryridge'' from the B -tree of Figure 11.11, the

parent is left with only one pointer, and must be coalesced with a sibling node.

o Sometimes higher-level nodes must also be coalesced. o If the root becomes empty as a result, the tree is one level less deep

(Figure 11.13). o Sometimes the pointers must be redistributed to keep the tree balanced. o Deleting ``Perryridge'' from Figure 11.11 produces Figure 11.14.

4. To summarize: o Insertion and deletion are complicated, but require relatively few

operations. o Number of operations required for insertion and deletion is proportional

to logarithm of number of search keys. o B -trees are fast as index structures for database.

B -Tree File Organization

1. The B -tree structure is used not only as an index but also as an organizer for records into a file.

2. In a B -tree file organization, the leaf nodes of the tree store records instead of storing pointers to records, as shown in Fig. 11.17.

3. Since records are usually larger than pointers, the maximum number of records that can be stored in a leaf node is less than the maximum number of pointers in a nonleaf node.

4. However, the leaf node are still required to be at least half full. 5. Insertion and deletion from a B -tree file organization are handled in the same

way as that in a B -tree index. 6. When a B -tree is used for file organization, space utilization is particularly

important. We can improve the space utilization by involving more sibling nodes in redistribution during splits and merges.

7. In general, if m nodes are involved in redistribution, each node can be

guaranteed to contain at least entries. However, the cost of update becomes higher as more siblings are involved in redistribution.

B-Tree Index Files

1. B-tree indices are similar to B -tree indices. o Difference is that B-tree eliminates the redundant storage of search key

values. o In B -tree of Figure 11.11, some search key values appear twice. o A corresponding B-tree of Figure 11.18 allows search key values to

appear only once. o Thus we can store the index in less space.

Figure 11.8: Leaf and nonleaf node of a B-tree.

2. Advantages: o Lack of redundant storage (but only marginally different). o Some searches are faster (key may be in non-leaf node).

3. Disadvantages: o Leaf and non-leaf nodes are of different size (complicates storage) o Deletion may occur in a non-leaf node (more complicated)

Generally, the structural simplicity of B -tree is preferred.

Static Hashing

1. Index schemes force us to traverse an index structure. Hashing avoids this.

Hash File Organization

1. Hashing involves computing the address of a data item by computing a function on the search key value.

2. A hash function h is a function from the set of all search key values K to the set of all bucket addresses B.

o We choose a number of buckets to correspond to the number of search key values we will have stored in the database.

o To perform a lookup on a search key value , we compute , and search the bucket with that address.

o If two search keys i and j map to the same address, because

, then the bucket at the address obtained will contain records with both search key values.

o In this case we will have to check the search key value of every record in the bucket to get the ones we want.

o Insertion and deletion are simple.

Hash Functions

1. A good hash function gives an average-case lookup that is a small constant, independent of the number of search keys.

2. We hope records are distributed uniformly among the buckets. 3. The worst hash function maps all keys to the same bucket. 4. The best hash function maps all keys to distinct addresses. 5. Ideally, distribution of keys to addresses is uniform and random. 6. Suppose we have 26 buckets, and map names beginning with ith letter of the

alphabet to the ith bucket. o Problem: this does not give uniform distribution. o Many more names will be mapped to ``A'' than to ``X''. o Typical hash functions perform some operation on the internal binary

machine representations of characters in a key. o For example, compute the sum, modulo # of buckets, of the binary

representations of characters of the search key. o See Figure 11.18, using this method for 10 buckets (assuming the ith

character in the alphabet is represented by integer i).

Handling of bucket overflows

1. Open hashing occurs where records are stored in different buckets. Compute the hash function and search the corresponding bucket to find a record.

2. Closed hashing occurs where all records are stored in one bucket. Hash function computes addresses within that bucket. (Deletions are difficult.) Not used much in database applications.

3. Drawback to our approach: Hash function must be chosen at implementation time.

o Number of buckets is fixed, but the database may grow. o If number is too large, we waste space. o If number is too small, we get too many ``collisions'', resulting in

records of many search key values being in the same bucket. o Choosing the number to be twice the number of search key values in the

file gives a good space/performance tradeoff.

Hash Indices

1. A hash index organizes the search keys with their associated pointers into a hash file structure.

2. We apply a hash function on a search key to identify a bucket, and store the key and its associated pointers in the bucket (or in overflow buckets).

3. Strictly speaking, hash indices are only secondary index structures, since if a file itself is organized using hashing, there is no need for a separate hash index structure on it.

Dynamic Hashing

1. As the database grows over time, we have three options: o Choose hash function based on current file size. Get performance

degradation as file grows. o Choose hash function based on anticipated file size. Space is wasted

initially. o Periodically re-organize hash structure as file grows. Requires selecting

new hash function, recomputing all addresses and generating new bucket assignments. Costly, and shuts down database.

2. Some hashing techniques allow the hash function to be modified dynamically to accommodate the growth or shrinking of the database. These are called dynamic hash functions.

o Extendable hashing is one form of dynamic hashing. o Extendable hashing splits and coalesces buckets as database size

changes. o This imposes some performance overhead, but space efficiency is

maintained. o As reorganization is on one bucket at a time, overhead is acceptably

low. 3. How does it work?

Figure 11.9: General extendable hash structure.

o We choose a hash function that is uniform and random that generates values over a relatively large range.

o Range is b-bit binary integers (typically b=32). o is over 4 billion, so we don't generate that many buckets! o Instead we create buckets on demand, and do not use all b bits of the

hash initially.

o At any point we use i bits where . o The i bits are used as an offset into a table of bucket addresses. o Value of i grows and shrinks with the database. o Figure 11.19 shows an extendable hash structure. o Note that the i appearing over the bucket address table tells how many

bits are required to determine the correct bucket. o It may be the case that several entries point to the same bucket. o All such entries will have a common hash prefix, but the length of this

prefix may be less than i. o So we give each bucket an integer giving the length of the common hash

prefix.

o This is shown in Figure 11.9 (textbook 11.19) as . o Number of bucket entries pointing to bucket j is then .

4. To find the bucket containing search key value :

o Compute .

o Take the first i high order bits of . o Look at the corresponding table entry for this i-bit string. o Follow the bucket pointer in the table entry.

5. We now look at insertions in an extendable hashing scheme. o Follow the same procedure for lookup, ending up in some bucket j. o If there is room in the bucket, insert information and insert record in the

file.

http://www.cs.sfu.ca/CourseCentral/354/zaiane/material/notes/Chapter11/node20.html#figx119exthash

o If the bucket is full, we must split the bucket, and redistribute the records.

o If bucket is split we may need to increase the number of bits we use in the hash.

6. Two cases exist:

1. If , then only one entry in the bucket address table points to bucket j.

o Then we need to increase the size of the bucket address table so that we can include pointers to the two buckets that result from splitting bucket j.

o We increment i by one, thus considering more of the hash, and doubling the size of the bucket address table.

o Each entry is replaced by two entries, each containing original value. o Now two entries in bucket address table point to bucket j. o We allocate a new bucket z, and set the second pointer to point to z.

o Set and to i. o Rehash all records in bucket j which are put in either j or z. o Now insert new record. o It is remotely possible, but unlikely, that the new hash will still put all of

the records in one bucket. o If so, split again and increment i again.

2. If , then more than one entry in the bucket address table points to bucket j.

o Then we can split bucket j without increasing the size of the bucket address table (why?).

o Note that all entries that point to bucket j correspond to hash prefixes

that have the same value on the leftmost bits.

o We allocate a new bucket z, and set and to the original value plus 1. o Now adjust entries in the bucket address table that previously pointed to

bucket j. o Leave the first half pointing to bucket j, and make the rest point to

bucket z. o Rehash each record in bucket j as before. o Reattempt new insert.

7. Note that in both cases we only need to rehash records in bucket j. 8. Deletion of records is similar. Buckets may have to be coalesced, and bucket

address table may have to be halved. 9. Insertion is illustrated for the example deposit file of Figure 11.20.

o 32-bit hash values on bname are shown in Figure 11.21. o An initial empty hash structure is shown in Figure 11.22. o We insert records one by one.

o We (unrealistically) assume that a bucket can only hold 2 records, in order to illustrate both situations described.

o As we insert the Perryridge and Round Hill records, this first bucket becomes full.

o When we insert the next record (Downtown), we must split the bucket.

o Since , we need to increase the number of bits we use from the hash.

o We now use 1 bit, allowing us buckets. o This makes us double the size of the bucket address table to two entries. o We split the bucket, placing the records whose search key hash begins

with 1 in the new bucket, and those with a 0 in the old bucket (Figure 11.23).

o Next we attempt to insert the Redwood record, and find it hashes to 1.

o That bucket is full, and . o So we must split that bucket, increasing the number of bits we must use

to 2. o This necessitates doubling the bucket address table again to four entries

(Figure 11.24). o We rehash the entries in the old bucket. o We continue on for the deposit records of Figure 11.20, obtaining the

extendable hash structure of Figure 11.25. 10. Advantages:

o Extendable hashing provides performance that does not degrade as the file grows.

o Minimal space overhead - no buckets need be reserved for future use. Bucket address table only contains one pointer for each hash value of current prefix length.

11. Disadvantages: o Extra level of indirection in the bucket address table o Added complexity

12. Summary: A highly attractive technique, provided we accept added complexity.

Comparison of Indexing and Hashing

1. To make a wise choice between the methods seen, database designer must consider the following issues:

o Is the cost of periodic re-organization of index or hash structure acceptable?

o What is the relative frequence of insertion and deletion? o Is it desirable to optimize average access time at the expense of

increasing worst-case access time? o What types of queries are users likely to pose?

2. The last issue is critical to the choice between indexing and hashing. If most queries are of the form

3. aaaaaaaaaaaa¯select 4.5. from r6.

7. where

then to process this query the system will perform a lookup on an index or hash

structure for attribute with value c.

8. For these sorts of queries a hashing scheme is preferable. o Index lookup takes time proportional to log of number of values in R for

. o Hash structure provides lookup average time that is a small constant

(independent of database size). 9. However, the worst-case favors indexing:

o Hash worst-case gives time proportional to the number of values in R

for . o Index worst case still log of number of values in R.

10. Index methods are preferable where a range of values is specified in the query, e.g.

aaaaaaaaaaaa¯select

from r

where and

This query finds records with values in the range from to .

o Using an index structure, we can find the bucket for value , and then follow the pointer chain to read the next buckets in alphabetic (or numeric) order until we find .

o If we have a hash structure instead of an index, we can find a bucket for easily, but it is not easy to find the ``next bucket''.

o A good hash function assigns values randomly to buckets. o Also, each bucket may be assigned many search key values, so we

cannot chain them together. o To support range queries using a hash structure, we need a hash function

that preserves order.

o For example, if and are search key values and then

. o Such a function would ensure that buckets are in key order.

o Order-preserving hash functions that also provide randomness and uniformity are extremely difficult to find.

Index Definition in SQL

1. Some SQL implementations includes data definition commands to create and drop indices. The IBM SAA-SQL commands are

o An index is created by o aaaaaaaaaaaa¯create index <index-name>o

o on r (<attribute-list>)

o

o The attribute list is the list of attributes in relation r that form the search key for the index.

o To create an index on bname for the branch relation:

o aaaaaaaaaaaa¯create index b-index

on branch (bname) if the search key is a candidate key, we add the word unique to the

definition: aaaaaaaaaaaa¯create unique index b-index on branch (bname)

o If bname is not a candidate key, an error message will appear. o If the index creation succeeds, any attempt to insert a tuple violating this

requirement will fail. o The unique keyword is redundant if primary keys have been defined

with integrity constraints already.

To remove an index, the command is

aaaaaaaaaaaa¯drop index <index-name>

Multiple-Key Access

1. For some queries, it is advantageous to use multiple indices if they exist. 2. If there are two indices on deposit, one on bname and one on cname, then

suppose we have a query like

aaaaaaaaaaaa¯select balance

from deposit

where bname = ``Perryridge'' and balance = 1000

3. There are 3 possible strategies to process this query: o Use the index on bname to find all records pertaining to Perryridge

branch. Examine them to see if balance = 1000 o Use the index on balance to find all records pertaining to Williams.

Examine them to see if bname = ``Perryridge''. o Use index on bname to find pointers to records pertaining to Perryridge

branch. Use index on balance to find pointers to records pertaining to 1000. Take the intersection of these two sets of pointers.

4. The third strategy takes advantage of the existence of multiple indices. This may still not work well if

o There are a large number of Perryridge records AND o There are a large number of 1000 records AND o Only a small number of records pertain to both Perryridge and 1000.

5. To speed up multiple search key queries special structures can be maintained.

Grid File

1. A grid structure for queries on two search keys is a 2-dimensional grid, or array, indexed by values for the search keys. Figure 11.10 (textbook 11.31) shows part of a grid structure for the deposit file.

Figure 11.10: Grid structure for deposit file.

2. A particular entry in the array contains pointers to all records with the specified search key values.

o No special computations need to be done o Only the right records are accessed o Can also be used for single search key queries (one column or row)

http://www.cs.sfu.ca/CourseCentral/354/zaiane/material/notes/Chapter11/node24.html#figx126grid

o Easy to extend to queries on n search keys - construct an n-dimensional array.

o Significant improvement in processing time for multiple-key queries. o Imposes space overhead. o Performance overhead on insertion and deletion.

Partitioned Hashing

1.o An alternative approach to multiple-key queries. o To construct a structure for queries on deposit involving bname and

cname, we construct a hash structure for the key (cname, bname). o We split this hash function into two parts, one for each part of the key. o The first part depends only on the cname value. o The second part depends only on the bname value. o Figure 11.32 shows a sample partitioned hash function. o Note that pairs with the same cname or bname value will have 3 bits the

same in the appropriate position. o To find the balance in all of Williams' accounts at the Perryridge branch,

we compute h(Williams, Perryridge) and access the hash structure. 2. The same hash structure can be used to answer a query on one of the search

keys: o Compute part of partitioned hash. o Access hash structure and scan buckets for which that part of the hash

coincides. o Text doesn't say so, but the hash structure must have some grid-like

form imposed on it to enable searching the structure based on only some part of the hash.

3. Partitioned hashing can also be extended to n-key search.

New DB Applications

1. Traditional applications. 1. Four generations of traditional DB systems: file system, hierarchical,

CODASYL, and relational. All are designed for business applications: inventory, employee, university, bank, library, air-line reservation systems, etc.

2. Common features of `traditional' applications: 1. Uniformity: large number of similarly structured data items, all of

which have the same size, 2. Record orientation: the basic data items consist of fixed-length

records, 3. Small data items: each record is short, 4. Atomic fields: fields within a record are short and of fixed length.

There is no structure within fields. The 1st normal form holds.

5. Short transactions: within fractions of a second. There is no human interaction with a transaction during its execution.

6. Static conceptual schemes: The database scheme is changed infrequently. Only simple changes are allowed, e.g., in relational systems: create relation, remove relation, add/remove attributes to/from a relation scheme.

2. New applications 1. Engineering databases, CAx: computer-aided design (CAD),

manufacturing (CAM), engineering (CAE)), CIM (computer-integrated manufacturing).

Tasks: A CAD database stores data required pertaining to an engineering design, including the components of the items being designed, the inter-relationship of components, and old versions of designs.

2. Computer-aided software engineering (CASE): A CASE database stores data required to assist software developers, including source code, dependencies among software modules, definitions and uses of variables, and the development history of the software system.

3. Multimedia databases: A multimedia database contains spatial data, audio/video data, and the like. DBs of this sort arise from geophysical data, voice mail systems and graphics applications.

4. Office Information Systems (OIS): Office automation includes workstation-based tools for document creation and retrieval, tools for maintaining appointment calendars, and so on. An OIS DB must allow queries pertaining to schedules, documents, and contents of documents.

5. Hypertext databases: Hypertext is text enriched with links that point to other documents, e.g., WWW. Hypertext documents may also be structured in specific ways that help index them. Hypertext database must support the ability to retrieve documents based on links, and to query documents based on their structure.

3. Expected Features for New Applications 1. Complex objects: A complex object is an item that is viewed as a single

object in the real world, but that contains other objects (with an arbitrary complex internal structure). Often objects are stored hierarchically, representing the containment relationship. This leads to object-oriented DBs and nested relational DBs.

2. Behavioral data: Distinct objects may need to respond in different ways to the same command. For example, the deletion of certain tuples may require to delete other tuples in the case for weak entities. In CAD and CASE applications the behavior of different objects in response to a given command may be widely different. This behavioral information can be captured by storing executable code with objects in the database. This capability is provided by the methods of OODBs and by the rule base of KB systems.

3. Meta knowledge: General rules about the application rather than specific tuples (i.e., data about data) form an important part of expert databases.

4. Long duration transactions: CAD and CASE applications involve human interaction with the data. ``what-if'' modifications that the user may wish to undo, concurrent designer efforts that may lead to conflicts among transactions. Important concepts: nested transactions, correct, nonserializable executions.

The Object-Oriented Data Model

1. A data model is a logic organization of the real world objects (entities), constraints on them, and the relationships among objects. A DB language is a concrete syntax for a data model. A DB system implements a data model.

2. A core object-oriented data model consists of the following basic object-oriented concepts:

(1) object and object identifier: Any real world entity is uniformly modeled as an object (associated with a unique id: used to pinpoint an object to retrieve).

(2) attributes and methods: every object has a state (the set of values for the attributes of the object) and a behavior (the set of methods - program code - which operate on the state of the object). The state and behavior encapsulated in an object are accessed or invoked from outside the object only through explicit message passing.

[ An attribute is an instance variable, whose domain may be any class: user-defined or primitive. A class composition hierarchy (aggregation relationship) is orthogonal to the concept of a class hierarchy. The link in a class composition hierarchy may form cycles. ]

(3) class: a means of grouping all the objects which share the same set of attributes and methods. An object must belong to only one class as an instance of that class (instance-of relationship). A class is similar to an abstract data type. A class may also be primitive (no attributes), e.g., integer, string, Boolean.

(4) Class hierarchy and inheritance: derive a new class (subclass) from an existing class (superclass). The subclass inherits all the attributes and methods of the existing class and may have additional attributes and methods. single inheritance (class hierarchy) vs. multiple inheritance (class lattice).

Object Structure

1. The object-oriented paradigm is based on encapsulating code and data into a single unit. Conceptually, all interactions between an object and the rest of the system are via messages. Thus, the interface between an object and the rest of the system is defined by a set of allowed messages.

2. In general, an object has associated with it: o A set of variables that contain the data for the object. The value of each

variable is itself an object. o A set of messages to which the object responds. o A set of methods, each of which is a body of code to implement each

message; a method returns a value as the response to the message. 3. Motivation of using messages and methods.

All employee objects respond to the annual-salary message but in different computations for managers, tellers, etc. By encapsulation within the employee object itself the information about how to compute the annual salary, all employee objects present the same interface.

Since the only external interface presented by an object is the set of messages to which it responds, it is possible to (i) modify the definition of methods and variables without affecting the rest of the system, and (ii) replace a variable with the method that computes a value, e.g., age from birth_date.

The ability to modify the definition of an object without affecting the rest of the system is considered to be one of the major advantages of the OO programming paradigm.

4. Methods of an object may be classified as either read-only or update. Message can also be classified as read-only or update. Derived attributes of an entity in the ER model can be expressed as read-only messages.

Object Classes

1. Usually, there are many similar objects in a DB. By ``similar'', it means that they respond to the same messages, use the same methods, and have variables of the same name and type. We group similar objects to form a class. Each such object is called an instance of its class. E.g., in a bank DB, customers, accounts and loans are classes.

2. The definition of the class employee, written in pseudo-code. The definition shows the variables and the messages to which the objects of the class respond, but not the methods that handle the messages.

class employee {

/* Variables */ string name; string address; date start-date; int salary; /* Messages */

int annual-salary(); string get-name(); string get-address(); int set-address(string new-address); int employment-length(); };

3. Class: (i) captures the instance-of relationship, (ii) the basis on which a query may be formulated, (iii) enhance the integrity of OO systems by introducing type checking, and (iv) reducing replications of names and integrity-related specifications among objects in the same class.

4. The concept of classes is similar to the concept of abstract data types. There are several additional aspects to the class concept beyond those of ADTs. To represent these properties, we treat each class as itself being an object.

Metaclass: the class of a class. Most OODB systems do not support the strict notion of metaclass. In ORION, CLASS is the root of the class hierarchy (the metaclass of all other classes). A class object includes

o a set-valued variable whose value is the set of all objects that are instances of the class,

o implementation of a method for the message new, which creates a new instance of the class.

Inheritance

1. An object-oriented database schema typically requires a large number of classes. Often, however, several classes are similar. For example, bank employees are similar to customers.

2. In order to allow the direct representation of similarities among classes, we need to place classes in a specialization hierarchy. E.g., Fig. 8.1 is a specialization hierarchy for the ER model.

Figure 8.1: Specialization hierarchy for the banking example

The concept of a class hierarchy is similar to that of specialization in the ER model. The corresponding corresponding class hierarchy is shown in Fig. 8.2.

http://www.cs.sfu.ca/CourseCentral/354/zaiane/material/notes/Chapter8/node6.html#fig82


Figure 8.2: Class hierarchy corresponding to the banking example

The class hierarchy can be defined in pseudo-code in Fig. 8.3, in which the variables associated with each class are as follows. For brevity, we do not present the methods associated with these classes.

Figure 8.3: Definition of class hierarchy in pseudo-code

3. The keyword isa is used to indicate that a class is a specialization of another class. The specialization of a class are called subclasses. E.g., employee is a subclass of person; teller is a subclass of employee. Conversely, employee is a superclass of teller.

4. Class hierarchy and inheritance of properties from more general classes. E.g., an object representing an officer contains all the variables of classes officer, employee, and person. Methods are inherited in a manner identical to inheritance of variables.

5. An important benefit of inheritance in OO systems is the notion of substitutability: Any method of a class, A, can be equally well be invoked with an object belonging to any subclass B of A. This characteristic leads to code-


reuse: methods and functions in class A (such as get-name() in class person) do not have to be rewritten again for objects of class B).

6. Two plausible ways of associating objects with nonleaf classes: o associate with the employee class all employee objects including those

that are instances of officer, teller, and secretary. o associate with the employee class only those employee objects that are

instances neither officer, nor teller, nor secretary.

Typically, the latter choice is made in OO systems. It is possible to determine the set of all employee objects in this case by taking the union of those objects associated with all classes in the subtree rooted at employee.

7. Most OO systems allow specialization to be partial, i.e., they allow objects that belong to a class such as employee that do not belong to any of that class's subclasses.

Multiple Inheritance

1. In most cases, tree-structured organization of classes is adequate to describe applications. In such cases, all superclasses of a class are ancestors of descendants of another in the hierarchy. However, there are situations that cannot be represented well in a tree-structured class hierarchy.

2. Example. We could create subclasses: part-time-teller, full-time-teller, etc., as shown in Fig. 8.4. But problems: (1) redundancy leads to potential inconsistency on updates; and (2) the hierarchy cannot represent full/part- time employees who are neither secretaries nor tellers.

Figure 8.4: Class hierarchy for full- and part-time employees.

3. Multiple inheritance: the ability of class to inherit variables and methods from multiple superclasses.

4. The class/subclass relationship is represented by a rooted directed acyclic graph (DAG) in which a class may have more than one superclass.


Figure 8.5: Class DAG for the banking example.

5. Handling name conflicts: When multiple inheritance is used, there is potential ambiguity if the same variable or method can be inherited from more than one superclass.

6. Example. In our banking example, we may define a variable pay for each full-time, part-time, teller and secretary as follows:

o full-time: pay is an integer from 0 to 100,000 containing annual salary. o part-time: pay is an integer from 0 to 20 containing an hourly rate of

pay. o teller: pay is an integer from 0 to 20,000 containing the annual salary. o secretary: pay is an integer from 0 to 25,000 containing the annual

salary. 7. For part-time-secretary, it could inherit the definition of pay from either part-

time or secretary. We have the following options: o Include both variables, renaming them to part-time-pay and secretary-

pay. o Choose one or the other based on the order of creation. o Force the user the make a choice at the time of class definition. o Treat the situation as an error.

No single solution has been accepted as best, and different systems make different choices.

8. Not all cases of multiple inheritance lead to ambiguity. If, instead of defining pay, we retain the definition of variable salary in class employee, and define it nowhere else, then all the subclasses inherit salary from employee (no ambiguity).

9. We can use multiple inheritance to model the concept of roles. For example, for subclasses, student, teacher and footballPlayer, an object can belong to several categories at once and each of these categories is called a role. We can use multiple inheritance to create subclasses, such as student-teacher, student-footballPlayer, and so on to model the possibility of an object simultaneously having multiple roles

Object Identity

1. Object identity: An object retains its identity even if some or all of the values of variables or definitions of methods change over time.

This concept of object identity is necessary in applications but doe not apply to tuples of a relational database.

2. Object identity is a stronger notion of identity than typically found in programming languages or in data models not based on object orientation.

3. Several forms of identity: o value: A data value is used for identity (e.g., the primary key of a tuple

in a relational database). o name: A user-supplied name is used for identity (e.g., file name in a file

system). o built-in: A notion of identity is built-into the data model or

programming languages, and no user-supplied identifier is required (e.g., in OO systems).

4. Object identity is typically implemented via a unique, system-generated OID. The value of the OID is not visible to the external user, but is used internally by the system to identify each object uniquely and to create and manage inter-object references.

5. There are many situations where having the system generate identifiers automatically is a benefit, since it frees humans from performing that task. However, this ability should be used with care. System-generated identifiers are usually specific to the system, and have to be translated if data are moved to a different database system. System-generated identifiers may be redundant if the entities being modeled already have unique identifiers external to the system, e.g., SIN#.

Object Containment

1. Objects that contain other objects are called complex or composite objects. There can be multiple levels of containment, forming a containment hierarchy among objects.

2. Example: A bicycle design database:

Figure 8.6: Containment hierarchy for bicycle-design database.

Fig. 8.6 shows the containment relationship in a schematic way by listing class names. Thus the links between classes must be interpreted as is-part-of, rather than the is-a interpretation of links in an inheritance hierarchy.


3. Containment allows data to be viewed at different granularities by different users. E.g., wheel by wheel designer but bicycle by a sales-person.

The containment hierarchy is used to find all objects contained in a bicycle object.

4. In certain applications, an object may be contained in several objects. In such cases, the containment relationship is represented by a DAG rather than by a hierarchy.

Object-Oriented Languages

The expression of object-orientation can be done in one of two ways.

1. The concepts of object-orientation can be used purely as a design tool, and are encoded into, e.g., a relational database. E.g., ER modeling.

2. The concepts of object-orientation are incorporated into a language that is used to manipulate the database. There are several possible languages into which the concepts can be integrated.

1. Extend a DML such as SQL by adding complex types and object-orientation. Systems that provide object-oriented extensions to relational systems are called object-relational systems.

2. Take an existing object-oriented programming language and extend it to deal with database. Such languages are called persistent programming languages.

Persistent Programming Languages

1. Persistent data: data that continue to exist even after the program that created it has terminated.

2. A persistent programming language is a programming language extended with constructs to handle persistent data. It distinguishes with embedded SQL in at least two ways:

1. In a persistent program language, query language is fully integrated with the host language and both share the same type system. Any format changes required in databases are carried out transparently.

Comparison with Embedded SQL where (1) host and DML have different type systems, code conversion operates outside of OO type system, and hence has a higher chance of having undetected errors; (2) format conversion takes a substantial amount of code.

2. Using Embedded SQL, a programmer is responsible for writing explicit code to fetch data into memory or store data back to the database.

In a persistent program language, a programmer can manipulate persistent data without having to write such code explicitly.

Drawbacks: (1) Powerful but easy to make programming errors that damage the database; (2) harder to do automatic high-level optimization; and (3) do not support declarative querying well.

Persistence of Objects

Several approaches have been proposed to make the objects persistent.

1. persistence by class. Declare class to be persistent: all objects of the class are then persistent objects. Simple, not flexible since it is often useful to have both transient and persistent objects in a single class. In many OODB systems, declaring a class to be persistent is interpreted as ``persistable'' -- objects in the class potentially can be made persistent.

2. persistence by creation. Introduce new syntax to create persistent objects. 3. persistence by marking. Mark an object persistent after it is created (and before

the program terminates). 4. persistence by reference. One or more objects are explicitly declared as (root)

persistent objects. All other objects are persistent iff they are referred, directly or indirectly, from a root persistent object. It is easy to make the entire data structure persistent by merely declaring the root of the structure as persistent, but is expensive to follow the chains in detection for a database system.

Object Identity and Pointers

1. The association of an object with a physical location in storage (as in C++) may change over time.

2. There are several degrees of permanence of identity: o intraprocedure: Identity persists only during the execution of a single

procedure, e.g., local variables within procedures. o intraprogram: Identity persists only during the execution of a single

program or query, e.g., global variables in programming languages, and main memory or virtual memory pointers.

o interprogram: Identity persists from one program execution to another, e.g., pointers to file system data on disk but may change if the way data is stored in the file system is changed.

o persistent: Identity persists not only among program executions but also among structural reorganizations of the data. This is the persistent form of identity required for object-oriented systems.

3. In persistent extension of C++, object identifiers are implemented as ``persistent pointers'' which can be viewed as a pointer to an object in the database.

Storage and Access of Persistent Objects

1. How are objects stored in a database?

Code (that implements methods) should be stored in the database as part of the schema, along with type definitions, but many implementations store them outside of the database, to avoid having to integrate system software such as compilers with the database system.

Data: stored individually for each object.

2. How to find the objects? 1. Give names to objects like we give names to files: works only for small

sets of objects. 2. Expose object identifiers or persistent pointers to the objects: 3. Store the collections of object and allow programs to iterate over the

collections to find required objects. The collections can be modeled as objects of a collection type. A special case of a collection is a class extent, which is a collection of all objects belonging to the class.

Most OODB systems support all three ways of accessing persistent objects. All objects have object identifiers. Names are typically given only to class extents and other collection objects, and perhaps to other selected objects, but most objects are not given names. Class extents are usually maintained for all classed that can have persistent objects, but in many implementations, they contain only persistent objects of the class.

OBJECT MODELLING

Objects are the physical and conceptual things we find in the universe around us. Hardware, software, documents, human beings, and even concepts are all examples of objects. For purposes of modeling his or her company, a chief executive officer could view employees, buildings, divisions, documents, and benefits packages as objects. An automotive engineer would see tires, doors, engines, top speed, and the current fuel level as objects. Atoms, molecules, volumes, and temperatures would all be objects a chemist might consider in creating an object-oriented simulation of a chemical reaction. Finally, a software engineer would consider stacks, queues, windows, and check boxes as objects.

Objects are thought of as having state. The state of an object is the condition of the object, or a set of circumstances describing the object. It is not uncommon to hear people talk about the "state information" associated with a particular object. For example, the state of a bank account object would include the current balance, the state of a clock object would be the current time, the state of an electric light bulb would be "on" or "off." For complex objects like a human being or an automobile, a complete description of the state might be very complex. Fortunately, when we use objects to model real world or imagined situations, we typically restrict the possible states of the objects to only those that are relevant to our models.

We also think of the state of an object as something that is internal to an object. For example, if we place a message in a mailbox, the (internal) state of the mailbox object is changed, whereas the (internal) state of the message object remains unchanged.

Sometimes people think of objects as being strictly static. That is, the state of an object will not change unless something outside of the object requests the object to change its state. Indeed, many objects are passive (static). A list of names does not spontaneously add new names to itself, nor would we expect it to spontaneously delete names from itself.

However, it is possible for some objects to change their own state. If an object is capable of spontaneously changing its own state, we refer to it as an "object with life." (Objects with life are sometimes also called "active objects" or "actors.") Clocks and timers are common examples of objects with life. If we were modeling a business process, we would recognize that salespeople and customers were also objects with life.

Classes, Metaclasses, Parameterized Classes, and Exemplars

There are two broad categories of objects: classes and instances. Users of object-oriented technology usually think of classes as containing the information necessary to create instances, i.e., the structure and capabilities of an instance is determined by its corresponding class. There are three commonly used (and different) views on the definition for "class":

A class is a pattern, template, or blueprint for a category of structurally identical items. The items created using the class are called instances. This is often referred to as the "class as a `cookie cutter'" view. As you might guess, the instances are the "cookies."

A class is a thing that consists of both a pattern and a mechanism for creating items based on that pattern. This is the "class as an `instance factory'" view; instances are the individual items that are "manufactured" (created) using the class's creation mechanism.

A class is the set of all items created using a specific pattern. Said another way, the class is the set of all instances of that pattern.

We should note that it is possible for an instance of a class to also be a class. A metaclass is a class whose instances themselves are classes. This means when we use the instance creation mechanism in a metaclass, the instance created will itself be a class. The instance creation mechanism of this class can, in turn, be used to create instances -- although these instances may or may not themselves be classes.

A concept very similar to the metaclass is the parameterized class. A parameterized class is a template for a class wherein specific items have been identified as being required to create non-parameterized classes based on the template. In effect, a parameterized class can be viewed as a "fill in the blanks" version of a class. One

cannot directly use the instance creation mechanism of a parameterized class. First, we must supply the required parameters, resulting in the creation of a non-parameterized class. Once we have a non-parameterized class, we can use its creation mechanisms to create instances.

In this article, we will use the term "class" to mean metaclass, parameterized class, or a class that is neither a metaclass nor a parameterized class. We will make a distinction only when it is necessary to do so. Further, we will occasionally refer to "non-class instances." A non-class instance is an instance of a class, but is itself not a class. An instance of a metaclass, for example, would not be a non-class instance.

In this article, we will sometimes refer to "instantiation." Instantiation has two common meanings:

as a verb, instantiation is the process of creating an instance of a class, and as a noun, an instantiation is an instance of a class.

Some people restrict the use of the term "object" to instances of classes. For these people, classes are not objects. However, when these people are confronted with the concepts of metaclasses and parameterized classes, they have a difficulty attempting to resolve the "problems" these concepts introduce. For example, is a class that is an instance of a metaclass an object -- even though it is itself a class? In this article, we will use the term "object" to refer to both classes and their instances. We will only distinguish between the two when needed.

Black Boxes and Interfaces

Objects are "black boxes." Specifically, the underlying implementations of objects are hidden from those that use the object. In object-oriented systems, it is only the producer (creator, designer, or builder) of an object that knows the details about the internal construction of that object. The consumers (users) of an object are denied knowledge of the inner workings of the object, and must deal with an object via one of its three distinct interfaces:

the "public" interface. This is the interface that is open (visible) to everybody. the "inheritance" interface. This is the interface that is accessible only by direct

specializations of the object. (We will discuss inheritance and specialization later in this chapter.) In class-based object-oriented systems, only classes can provide an inheritance interface.

the "parameter" interface. In the case of parameterized classes, the parameter interface defines the parameters that must be supplied to create an instance of the parameterized class.

Another way of saying that an item is in the public interface of an object is to say that the object "exports" that item. Similarly, when an object requires information from outside of itself (e.g., as with the parameters in a parameterized class), we can say that the object needs to "import" that information.

Aggregation

It is, of course, possible for objects to be composed of other objects. Aggregation is either:

the process of creating a new object from two or more other objects, or an object that is composed of two or more other objects.

For example, a date object could be fashioned from a month object, a day object, and a year object. A list of names object, for example, can be thought of as containing many name objects.

Specialization and Inheritance

Aggregation is not the only way in which two objects can be related. One object can be a specialization of another object. Specialization is either:

the process of defining a new object based on a (typically) more narrow definition of an existing object, or

an object that is directly related to, and more narrowly defined than, another object.

Specialization is usually associated with classes. It is usually only in the so-called "classless" object-oriented systems that we think of specialization for objects other than classes.

Depending on their technical background, there are a number of different ways in which people express specialization. For example, those who are familiar with an object-oriented programming language called Smalltalk refer to specializations as "subclasses" and to the corresponding generalizations of these specializations as "superclasses." Those with a background in the C++ programming language use the term "derived class" for specialization and "base class" for corresponding generalizations.

It is common to say that everything that is true for a generalization is also true for its corresponding specialization. We can, for example, define "checking accounts" and "savings accounts" as specializations of "bank accounts." Another way of saying this is that a checking account is a kind of bank account, and a savings account is a kind of bank account. Still another way of expressing this idea is to say that everything that was true for the bank account is also true for the savings account and the checking account.

In an object-oriented context, we speak of specializations as "inheriting" characteristics from their corresponding generalizations. Inheritance can be defined as the process whereby one object acquires (gets, receives) characteristics from one or more other objects. Some object-oriented systems permit only single inheritance, a situation in which a specialization may only acquire characteristics from a single

generalization. Many object-oriented systems, however, allow for multiple inheritance, a situation in which a specialization may acquire characteristics from two or more corresponding generalizations.

Our previous discussion of the bank account, checking account, and savings account was an example of single inheritance. A telescope and a television set are both specializations of "device that enables one to see things far away." A television set is also a kind of "electronic device." You might say that a television set acquires characteristics from two different generalizations, "device that enables one to see things far away" and "electronic device." Therefore, a television set is a product of multiple inheritance.

Abstract Classes

We usually think of classes as being complete definitions. However, there are situations where incomplete definitions are useful, and classes that represent these incomplete definitions are equally useful. For example, in everyday conversation, we might talk about such items as bank accounts, insurance policies, and houses. In object-oriented thinking, we often isolate useful, but incomplete, concepts such as these into their own special classes.

Abstract classes are classes that embody coherent and cohesive, but incomplete, concepts, and in turn, make these characteristics available to their specializations via inheritance. People sometimes use the terms "partial type" and "abstract superclass" as synonyms for abstract class. While we would never create instances of abstract classes, we most certainly would make their individual characteristics available to more specialized classes via inheritance.

For example, consider the concept of an automobile. On one hand, most people know what an automobile is. On the other hand, "automobile" is not a complete definition for any vehicle. It would be quite accurate to describe "automobile" as the set of characteristics that make a thing an automobile, in other words, the "essence of automobile-ness."

Operations

The public interface of an object typically contains three different categories of items:

operations (sometimes referred to as "method selectors," "method interfaces," "messages," or "methods"),

constants, and exceptions.

An operation in the public interface of an object advertises a functional capability of that object. For example, "deposit" would be an operation in the public interface of a bank account object, "what is current temperature" would be an operation in the public

interface of a temperature sensor object, and "increment" would be an operation in the public interface of a counter object.

The actual algorithm for accomplishing an operation is referred to as a method. Unlike operations, methods are not in the public interface for an object. Rather, methods are hidden on the inside of an object. So, while users of bank account objects would know that they could make a deposit into a bank account, they would be unaware of the details as to how that deposit actually got credited to the bank account.

MODULE 5: OPERATIONS AND MANAGEMENTClient / Server and Databases – Data Warehousing – Query Processing – Concurrency Management – Heterogeneous and Homogenous Systems – Distributed Databases –Controls – Atomicity, Recovery – Security, Back-up and Recovery.

The client/server model is a computing model that acts as a distributed application which partitions tasks or workloads between the providers of a resource or service, called servers, and service requesters, called clients.[1] Often clients and servers communicate over a computer network on separate hardware, but both client and server may reside in the same system. A server machine is a host that is running one or more server programs which share their resources with clients. A client does not share any of its resources, but requests a server's content or service function. Clients therefore initiate communication sessions with servers which await incoming requests.

Schematic clients-server interaction.

The client/server characteristic describes the relationship of cooperating programs in an application. The server component provides a function or service to one or many clients, which initiate requests for such services. A notable example of this is the way OpenGL treats the video card of a computer as a server, with the actual application making rendering requests to it. This model is further solidified with the OpenGL Shading Language, with the user writing small programs that live in video memory, and are requested from the main program through the graphics driver.

Functions such as email exchange, web access and database access are built on the client/server model. Users accessing banking services from their computer use a web browser client to send a request to a web server at a bank. That web server runs a program which may in turn, forward the request to its own database client program, which sends a request to the bank's database server (which runs on another computer) to retrieve the account information. The balance and transaction records are returned to the bank database client, which in turn serves it back to the user's web browser client, displaying the results to the user. The client–server model has become one of the central ideas of network computing. Many business applications being written today use the client–server model, as do the Internet's main application protocols, such as HTTP, SMTP, Telnet, and DNS.

The interaction between client and server is often described using sequence diagrams. The Unified Modeling Language has support for sequence diagrams.

http://en.wikipedia.org/wiki/Unified_Modeling_Language

http://en.wikipedia.org/wiki/Sequence_diagram

http://en.wikipedia.org/wiki/Domain_Name_System

http://en.wikipedia.org/wiki/Telnet

http://en.wikipedia.org/wiki/SMTP

http://en.wikipedia.org/wiki/HTTP

http://en.wikipedia.org/wiki/Network_computing

http://en.wikipedia.org/wiki/GLSL

http://en.wikipedia.org/wiki/GLSL

http://en.wikipedia.org/wiki/OpenGL

http://en.wikipedia.org/wiki/Computer_network

http://en.wikipedia.org/wiki/Client_(computing)

http://en.wikipedia.org/wiki/Server_(computing)

http://en.wikipedia.org/wiki/Distributed_application

http://en.wikipedia.org/wiki/Computing

http://en.wikipedia.org/w/index.php?title=File:Client-server-model.svg&page=1

http://en.wikipedia.org/wiki/File:Client-server-model.svg

Specific types of clients include web browsers, email clients, and online chat clients.

Specific types of servers include web servers, ftp servers, application servers, database servers, name servers, mail servers, file servers, print servers, and terminal servers. Most web services are also types of servers.

Comparison to peer-to-peer architecture

A client/server network involves multiple clients connecting to a single, central server. The file server on a client/server network must have robust hardware configurations to handle large volumes of client service requests. Hence servers usually are high speed computers with large hard disk capacity.

By contrast, peer-to-peer networks involve two or more computers pooling individual resources such as disk drives, CD-ROMs and printers. These shared resources are available to every computer in the network, while each of them communicates in a session. Each computer acts as both the client and the server which means all the computers on the network are equals, that is where the term peer-to-peer comes from. The advantage of peer-to-peer networking is the easier control concept not requiring any additional coordination entity and not delaying transfers by routing via server entities. However, the collision of session may be larger than with routing via server nodes.

In the peer to peer network, software applications can be installed on the single computer and shared by every computer in the network. They are also cheaper to set up because most desktop operating systems have the software required for the network installed by default. On the other hand, the client/server model works with any size or physical layout of LAN and doesn't tend to slow down with a heavy use.[3]

Peer-to-peer networks are typically less secure than client/server networks because security is handled by the individual computers, not controlled and supervised on the network as a whole. The resources of the computers in the network can become congested as they have to support not only the workstation user, but also the requests from network users. It may be difficult to provide systemwide services when the client operating system typically used in this type of network is incapable of hosting the service.

Client/server networks with their additional capacities have a higher initial setup cost for networking than peer to peer networks. The long-term aspect of administering a client/server network with applications largely server-hosted surely saves administering effort compared to administering the application settings per each client. In addition the concentration of functions in performant servers allows for lower grade performance qualification of the clients.

It is possible to set up a server on a modern desktop computer, but it is recommended to consider investment in enterprise-wide server facilities with standardised choice of hardware and software and with a systematic and remotely operable administering

http://en.wikipedia.org/wiki/Peer-to-peer

http://en.wikipedia.org/wiki/Web_service

http://en.wikipedia.org/wiki/Terminal_server

http://en.wikipedia.org/wiki/Terminal_server

http://en.wikipedia.org/wiki/Print_server

http://en.wikipedia.org/wiki/File_server

http://en.wikipedia.org/wiki/Mail_server

http://en.wikipedia.org/wiki/Name_server

http://en.wikipedia.org/wiki/Database_server

http://en.wikipedia.org/wiki/Application_server

http://en.wikipedia.org/wiki/Ftp_server

http://en.wikipedia.org/wiki/Web_server

http://en.wikipedia.org/wiki/Online_chat

http://en.wikipedia.org/wiki/Email_client

http://en.wikipedia.org/wiki/Web_browser

strategy. It is easier to configure and manage the server hardware and software compared to the distributed administering requirements with a flock of computers.

In centralized database systems, the database system, application programs, and user-interface all are executed on a single system and dummy terminals are connected to it. The processing power of single system is utilized and dummy terminals are used only to display the information. As the personal computers became faster, more powerful, and cheaper, the database system started to exploit the available processing power of the system at the user’s side, which led to the development of client/server architecture. In client/server architecture, the processing power of the computer system at the user’s end is utilized by processing the user-interface on that system.A client is a computer system that sends request to the server connected to the network, and a server is a computer system that receives the request, processes it, and returns the requested information back to the client. Client and server are usually present at different sites. The end users (remote database users) work on client computer system and database system runs on the server. Servers can be of several types, for example, file servers, printer servers, web servers, database servers, etc. The client machines have user interfaces that help users to utilize the servers. It also provides users the local processing power to run local applications on the client side.

There are two approaches to implement client/server architecture. In the first approach, the user interface and application programs are placed on the client side and the database system on the server side. This architecture is called two-tier architecture. The application programs that reside at the client side invoke the DBMS at the server side. The application program interface standards like Open Database Connectivity (ODBC) and Java Database Connectivity (JDBC) are used r interaction between client and server.

The second approach, that is, three-tier architecture is primarily used for web-based applications. It adds intermediate layer known as application server (or web server) between the client and the database server. The client communicates with the application server, which in turn communicates with the database server. The application server stores the business rules (procedures and constraints) used for accessing data from database server. It checks the client’s credentials before forwarding a request to database server. Hence, it improves database security.

When a client requests for information, the application server accepts the request, processes it, and sends corresponding database commands to database server. The database server sends the result back to application server which is converted into GUI format and presented to the client.

Fundamentals of Distributed Databases

In recent years, the distributed database system has been emerging as an important area of information processing, and its popularity is increasing rapidly. A distributed database is a database that is under the control of a central DBMS in which not all

http://my.safaribooksonline.com/9788131731925/gloss01#gloss01_037

storage devices are attached to a common CPU. It may be stored on multiple computers located in the same physical location, or may be dispersed over a network of interconnected computers. Collections of data (e.g., in a database) can be distributed across multiple physical locations. In short, a distributed database is a logically interrelated collection of shared data, and a description of this data is physically distributed over a computer network. A distributed database must ensure the following:

The distribution is transparent – users must be able to interact with the system as if it is a single logical system. This applies to the system performance and method of accessing amongst other things.

The transactions are transparent – each transaction must maintain database integrity across multiple databases. Transactions may also be divided into subtransactions; each subtransaction affects one database system.

Features of a Distributed DBMS: A Distributed DBMS may have a number of local applications, but it has at least one global application. Thus, a distributed DBMS has the following features:

i. A distributed DBMS is a collection of logically related shared data.

ii. The data in a distributed DBMS is split into a number of fragments or partitions.

iii. Fragments may be replicated in a distributed system.

iv. Fragments/replicas are allocated to different sites.

v. In a distributed system, the sites are linked by communications network.

vi. The data at each site is under the control of a DBMS.

vii. The DBMS at each site has its own right, that is, it can handle local applications independently.

viii. Each DBMS in a distributed system participates in at least one global application.

Advantages and Disadvantages of Distributed DBMS

A distributed DBMS has several advantages over traditional centralized DBMSs. These are listed in the following.

Sharing of information The major advantage in a distributed database system is the provision for sharing information. Users at one site in a distributed system may be able to access data residing at other sites. For example, consider an organization that has a number of branches throughout the country. Each branch stores its own data locally. Similarly, a user in one branch can access data from another branch; thus, information sharing is possible in a distributed system

Components of a Distributed DBMS

A Distributed DBMS controls the storage and efficient retrieval of logically interrelated data that are physically distributed among several sites. Therefore, a distributed DBMS includes the following components.

a. Computer workstations (sites or nodes) – A distributed DBMS consists of a number of computer workstations that form the network system. The distributed database system must be independent of the computer system hardware.

b. Network components (both hardware and software) – Each workstation in a distributed system contains a number of network hardware and software components. These components allow each site to interact and exchange data with each other site. Network system independence is a desirable property of the distributed system.

c. Communication media – In a distributed system, any type of communication (data transfer, information exchange) among nodes is carried out through communication media. This is a very important component of a distributed DBMS. It is desirable that a distributed DBMS be communication media independent, that is, it must be able to support several types of communication media.

d. Transaction processor (TP) – A TP is a software component that resides in each computer connected with the distributed system and is responsible for receiving and processing both local and remote applications’ data requests. This component is also known as the application processor (AP) or the transaction manager (TM).

e. Data processor (DP) – A DP is also a software component that resides in each computer connected with the distributed system and stores and retrieves data located at that site. The DP is also known as the data manager (DM). In a distributed DBMS, a DP may be a centralized DBMS.

Objectives of Distributed Concurrency Control

Concurrency control is the activity of coordinating concurrent accesses to a database in a multi-user system. Concurrency control allows users to access a database in a multi-programmed fashion while preserving the consistency of the data. The main technical difficulty in achieving this goal is the necessity to prevent database updates performed by one user from interfering with database retrievals and updates performed by other users. The concurrency control problem is exacerbated in a distributed DBMS (DDBMS), because in a distributed system users may access data stored at many different sites, and a concurrency control mechanism at one site cannot be aware of transactions at other sites instantaneously. The replication of data items in a distributed database adds extra complexity to the concurrency control mechanism. In this context, it is necessary to discuss the objectives of concurrency control

mechanism in a distributed database environment. Like in centralized DBMSs, the primary goal of the concurrency control mechanism in a DDBMS is to ensure that the consistency of the data items is preserved and each atomic action will be completed in a finite time if the system has not failed. In addition, a good concurrency control mechanism for a DDBMS has the following objectives:

It must be resilient to site and communication failures.

It should permit parallel execution of transactions to achieve maximum concurrency, thus satisfying performance requirements.

Its computational methods and storage mechanisms should be modest to minimize overhead.

It should perform satisfactorily in a network environment taking into consideration that it involves significant communication delay.

It must impose few constraints on the structure of atomic actions of transactions.

Atomicity

ATOMICITY :The phrase "all or nothing" succinctly describes the first ACID property of atomicity. When an update occurs to a database, either all or none of the update becomes available to anyone beyond the user or application performing the update. This update to the database is called a transaction and it either commits or aborts. This means that only a fragment of the update cannot be placed into the database, should a problem occur with either the hardware or the software involved. Features to consider for atomicity:

a transaction is a unit of operation - either all the transaction's actions are completed or none are

atomicity is maintained in the presence of deadlocks atomicity is maintained in the presence of database software failures atomicity is maintained in the presence of application software failures atomicity is maintained in the presence of CPU failures atomicity is maintained in the presence of disk failures atomicity can be turned off at the system level atomicity can be turned off at the session level

Data security is the protection of the database from unauthorized users. Only the authorized persons are allowed to access the database. Most of the users are allowed to access a part of database i.e., the data that is related to them or related to their department. Mostly, the DBA or head of department can access all the data in the database. Some users may be permitted only to retrieve data, whereas others are allowed to retrieve as well as to update data. The database access is controlled by the DBA. He creates the-accounts of users and gives rights to access the database. Typically, users or group of users are given usernames protected by

passwords.

Most of the DBMSs provide the security sub-system, which the DBA uses to create accounts of users and to specify account restrictions. The user enters his/her account number (or user name) and password to access the data from database. For example, if you have an account in the "hotmail.com" (a popular website), then you have to give your correct username and password to access your account of e-mail. Similarly, when you insert your ATM card into the auto teller machine (ATM) in a bank, the machine reads your ID number printed on the card and then asks you to enter your pin code (or password). In this way you can access your account.

Homogeneous and Heterogeneous Distributed DBMSs

A distributed DBMS may be classified as homogeneous or heterogeneous. In an ideal distributed database system, the sites would share a common global schema (although some relations may be stored only at some local sites), all sites would run the same database management software and the sites are aware of the existence of other sites. In a distributed system, if all sites use the same DBMS product, it is called a homogenous distributed database system. However, in reality, a distributed database has to be constructed by linking multiple already-existing database systems together, each with its own schema and possibly running different database management software. Such systems are called heterogeneous distributed database systems. In a heterogeneous distributed database system, sites may run different DBMS products that need not be based on the same underlying data model, and thus, the system may be composed of relational, network, hierarchical, and object-oriented DBMSs. Homogeneous distributed DBMS provides several advantages such as simplicity, ease of designing and incremental growth. It is much easier to design and manage a homogeneous distributed DBMS than a heterogeneous one. In a homogeneous distributed DBMS, making the addition of a new site to the distributed system is much easier, thereby providing incremental growth. These systems also improve performance by exploiting the parallel processing capability of multiple sites.

Functions of Distributed DBMS:A distributed DBMS manages the storage and processing of logically related data on interconnected computers wherein both data and processing functions are distributed among several sites. Thus, a distributed DBMS has at least all the functionality of a centralized DBMS. In addition, it must provide the following functions to be classified as distributed. Application interfaces A distributed DBMS provides application interfaces to interact with the end-users or application programs and with remote databases within the distributed system. This feature is most important and is provided by all systems that are distributed in nature.

http://www.blurtit.com/q9653304.html

dbms mba notes

Documents

inadequately

2nd normal

properly normalized

fully functionally

oriented programming

microsoft

concurrency

4th normal