database design concepts introduction notes (1)

52
SCHOOL OF INFORMATION SCIENCES AND TECHNOLOGY BTech (Hons) COMPUTER SCIENCE ICS 124: DATABASE DESIGN CONCEPTS INTRODUCTION

Upload: munatsi

Post on 16-Sep-2015

227 views

Category:

Documents


0 download

DESCRIPTION

database design notes

TRANSCRIPT

SCHOOL OF INFORMATION SCIENCES AND TECHNOLOGY

BTech (Hons) COMPUTER SCIENCEICS 124: DATABASE DESIGN CONCEPTS

INTRODUCTION

INTRODUCTION

What is a database? A database is a collection of related data. Data is known facts that can be recorded and that have implicit meaning. A database has the following implicit properties: A database represents some aspect of the real world, sometimes called the miniworld or the universe of discourse (DoD). Changes to the miniworld are reflected in the database. A database is a logically coherent collection of data with some inherent meaning. A random assortment of data cannot correctly be referred to as a database. A database is designed, built, and populated with data for a specific purpose. It has an intended group of users and some preconceived applications in which these users are interested.A database management system (DBMS) is a collection of programs that enables users to create and maintain a database. The DBMS is hence a general-purpose software system that facilitates the processes of defining, constructing, manipulating, and sharing databases among various users and applications. Defining a database involves specifying the data types, structures, and constraints for the data to be stored in the database. Constructing the database is the process of storing the data itself on some storage medium that is controlled by the DBMS. Manipulating a database includes such functions as querying the database to retrieve specific data, updating the database to reflect changes in the miniworld, and generating reports from the data. Sharing a database allows multiple users and programs to access the database concurrently. Protection includes both system protection against hardware or software malfunction (or crashes), and security protection against unauthorized or malicious access. A database system is the database and DBMS software together.

File systemsFile processing systems was an early attempt to computerize the manual filing system that we are all familiar with. A file system is a method for storing and organizingcomputerfiles and the data they contain to make it easy to find and access them. File systems may use a storage device such as a hard disk or CD-ROM and involve maintaining the physical location of the files.In our own home, we probably have some sort of filing system, which contains receipts, guarantees, invoices, bank statements, and such like. When we need to look something up, we go to the filing system and search through the system starting from the first entry until we find what we want. Alternatively, we may have an indexing system that helps to locate what we want more quickly. For example we may have divisions in the filing system or separate folders for different types of item that are in some way logically related.The manual filing system works well when the number of items to be stored is small. It even works quite adequately when there are large numbers of items and we have only to store and retrieve them. However, the manual filing system breaks down when we have to cross-reference or process theinformationin the files. For example, a typical real estate agent's office might have a separate file for each property for sale or rent, each potential buyer and renter, and each member of staff.Clearly the manual system is inadequate for this' type of work. The file based system was developed in response to the needs of industry for more efficient data access. In early processing systems, an organization's information was stored as groups of records in separate files.In the traditional approach, we used to store information in flat files which are maintained by the file system under the operating system's control. Here, flat files are files containing records having no structured relationship among them. The file handling which we learn under C/C ++ is the example of file processing system. The Application programs written in C/C ++ like programming languages go through the file system to access these flat. files as shown.Characteristics of File ProcessingSystemHere is the list of some important characteristics of file processing system: It is a group of files storing data of an organization. Each file is independent from one another. Each file is called a flat file. Each file contained and processed information for one specific function, such as accounting or inventory. Files are designed by using programs written in programming languages such as COBOL, C, C++. The physical implementation and access procedures are written into database application; therefore, physical changes resulted in intensive rework on the part of the programmer. As systems became more complex, file processing systems offered little flexibility, presented many limitations, and were difficult to maintain.Limitations of the File Processing SystemIFile-Based ApproachThere are following problems associated with the File Based Approach:1.Separated and Isolated Data:To make a decision, a user might need data from two separate files. First, the files were evaluated by analysts and programmers to determine the specific data required from each file and the relationships between the data and then applications could be written in a programming language to process and extract the needed data. Imagine the work involved if data from several files was needed.2.Duplication of data:Often the same information is stored in more than one file. Uncontrolled duplication of data is not required for several reasons, such as: Duplication is wasteful. It costs time and money to enter the data more than once It takes up additional storage space, again with associated costs. Duplication can lead to loss of data integrity; in other words the data is no longer consistent. For example, consider the duplication of data between the Payroll and Personnel departments. If a member of staff moves to new house and the change of address is communicated only to Personnel and not to Payroll, the person's pay slip will be sent to the wrong address. A more serious problem occurs if an employee is promoted with an associated increase in salary. Again, the change is notified to Personnel but the change does not filter through to Payroll. Now, the employee is receiving the wrong salary. When this error is detected, it will take time and effort to resolve. Both these examples, illustrate inconsistencies that may result from the duplication of data. As there is no automatic way for Personnel to update the data in the Payroll files, it is difficult to foresee such inconsistencies arising. Even if Payroll is notified of the changes, it is possible that the data will be entered incorrectly.3. Data Dependence:In file processing systems, files and records were described by specific physical formats that were coded into the application program by programmers. If the format of a certain record was changed, the code in each file containing that format must be updated. Furthermore, instructions for data storage and access were written into the application's code. Therefore, .changes in storage structure or access methods could greatly affect the processing or results of an application.In other words, in file based approach application programs are data dependent. It means that, with the change in the physical representation (how the data is physically represented in disk) or access technique (how it is physically accessed) of data, application programs are also affected and needs modification. In other words application programs are dependent on the how the data is physically stored and accessed.If for example, if the physical format of the master/transaction file is changed, by making the modification in the delimiter of the field or record, it necessitates that the application programs which depend on it must be modified.Let us consider a student file, where information of students is stored in text file and each field is separated by blank space as shown below:I Rahat 35 ThaparNow, if the delimiter of the field changes from blank space to semicolon as shown below:1; Rahat; 35; ThaparThen, the application programs using this file must be modified, because now it will token the field on semicolon; but earlier it was blank space.4.Difficulty in representing data from the user's view:To create useful applications for the user, often data from various files must be combined. In file processing it was difficult to determine relationships between isolated data in order to meet user requirements.5. Data Inflexibility:Program-data interdependency and data isolation, limited the flexibility of file processing systems in providing users with ad-hoc information requests6. Incompatible file formats:As the structure of files is embedded in the application programs, the structures are dependent on the application programming language. For example, the structure of a file generated by a COBOL program may be different from the structure of a file generated by a 'C' program. The direct incompatibility of such files makes them difficult to process jointly.7. Data Security.The security of data is low in file based system because, the data is maintained in the flat file(s) is easily accessible. For Example: Consider the Banking System. The Customer Transaction file has details about the total available balance of all customers. A Customer wants information about his account balance. In a file system it is difficult to give the Customer access to only his data in the file. Thus enforcing security constraints for the entire file or for certain data items are difficult.8. Transactional Problems.The File based system approach does not satisfy transaction properties like Atomicity, Consistency, Isolation and Durability properties commonly known as ACID properties.For example: Suppose, in a banking system, a transaction that transfers Rs. 1000 from account A to account B with initial values' of A and B being Rs. 5000 and Rs. 10000 respectively. If a system crash occurred after the withdrawal of Rs. 1000 from account A, but before depositing of amount in account B, it will result an inconsistent state of the system. It means that the transactions should not execute partially but wholly. This concept is known as Atomicity of a transaction (either 0% or 100% of transaction). It is difficult to achieve this property in a file based system.9. Concurrency problems.When multiple users access the same piece of data at same interval of time then it is called as concurrency of the system. When two or more users read the data simultaneously there is ll( problem, but when they like to update a file simultaneously, it may result in a problem.For example:Let us consider a scenario where in transaction T 1 a user transfers an amout1t 1000 fromAccount A to B (initial value of A is 5000 and B is 8000). In mean while, another transaction T2, tries to display the sum of account A and B is also executed. If both the transaction runs in parallel it may results inconsistency as shown below:The above schedule results inconsistency of database and it shows Rs.12,000 as sum of accounts A and B instead of Rs .13,000. The problem occurs because second concurrently running transaction T2, reads A and B at intermediate point and computes its sum, which results inconsistent value.10. Poor data modeling of real world. The file based system is not able to represent the complex data and interfile relationships, which results poor data modeling properties.The Database ApproachIn the database approach, a single repository of data is maintained that is defined once and then is accessed by various users. The following are the main characteristics of the database approach: Self-describing nature of a database systemA fundamental characteristic of the database approach is that the database system contains not only the database itself but also a complete definition or description of the database structure and constraints. This definition is stored in the DBMS catalog, which contains information such as the structure of each file, the type and storage format of each data item, and various constraints on the data. The information stored in the catalog is called meta-data, and it describes the structure of the primary database Insulation between programs and data, and data abstractionThe structure of data files is stored in the DBMS catalog separately from the access programs. We call this property program-data independence. The characteristic that allows program-data independence and program-operation independence is called data abstraction. A DBMS provides users with a conceptual representation of data that does not include many of the details of how the data is stored or how the operations are implemented. Informally, a data model is a type of data abstraction that is used to provide this conceptual representation. The data model uses logical concepts, such as objects, their properties, and their interrelationships, that may be easier for most users to understand than computer storage concepts. Hence, the data model hides storage and implementation details that are not of interest to most database users. Support of multiple views of the dataA database typically has many users, each of whom may require a different perspective or view of the database. A view may be a subset of the database or it may contain virtual data that is derived from the database files but is not explicitly stored. Some users may not need to be aware of whether the data they refer to is stored or derived. A multiuser DBMS whose users have a variety of distinct applications must provide facilities for defining multiple views Sharing of data and multiuser transaction processingA multiuser DBMS, as its name implies, must allow multiple users to access the database at the same time. This is essential if data for multiple applications is to be integrated and maintained in a single database. The DBMS must include concurrency control software to ensure that several users trying to update the same data do so in a controlled manner so that the result of the updates is correct.Roles in the database environmentDatabase AdministratorsIn any organization where many persons use the same resources, there is a need for a chief administrator to oversee and manage these resources. In a database environment, the primary resource is the database itself, and the secondary resource is the DBMS and related software. Administering these resources is the responsibility of the database administrator (DBA). The DBA is responsible for authorizing access to the database, for coordinating and monitoring its use, and for acquiring software and hardware resources as needed. The DBA is accountable for problems such as breach of security or poor system response time. In large organizations, the DBA is assisted by a staff that helps carry out these functions.Database DesignersDatabase designers are responsible for identifying the data to be stored in the database and for choosing appropriate structures to represent and store this data. These tasks are mostly undertaken before the database is actually implemented and populated with data. It is the responsibility of database designers to communicate with all prospective database users in order to understand their requirements, and to come up with a design that meets these requirements. In many cases, the designers are on the staff of the DBA and may be assigned other staff responsibilities after the database design is completed. Database designers typically interact with each potential group of users and develop views of the database that meet the data and processing requirements of these groups. Each view is then analyzed and integrated with the views of other user groups. The final database design must be capable of supporting the requirements of all user groups.End UsersEnd users are the people whose jobs require access to the database for querying, updating, and generating reports; the database primarily exists for their use. There are several categories of end users: Casual end users occasionally access the database, but they may need different information each time. They use a sophisticated database query language to specify their requests and are typically middle- or high-level managers or other occasional browsers. Naive or parametric end users make up a sizable portion of database end users. Their main job function revolves around constantly querying and updating the database, using standard types of queries and updates-called canned transactions-that have been carefully programmed and tested. The tasks that such users perform are varied:Bank tellers check account balances and post withdrawals and deposits.Reservation clerks fur airlines, hotels, and car rental companies check availability for a given request and make reservations.Clerks at receiving stations for courier mail enter package identifications via bar codes and descriptive information through buttons to update a central database of received and in-transit packages. Sophisticated end users include engineers, scientists, business analysts, and others who thoroughly familiarize themselves with the facilities of the DBMS so as to implement their applications to meet their complex requirements. Stand-alone users maintain personal databases by using ready-made program packages that provide easy-to-use menu-based or graphics-based interfaces. An example is the user of a tax package that stores a variety of personal financial data for tax purposes.A typical DBMS provides multiple facilities to access a database. Naive end users need to learn very little about the facilities provided by the DBMS; they have to understand only the user interfaces of the standard transactions designed and implemented for their use. Casual users learn only a few facilities that they may use repeatedly. Sophisticated users try to learn most of the DBMS facilities in order to achieve their complex requirements. Stand-alone users typically become very proficient in using a specific software package.System Analysts and Application Programmers (Software Engineers)System analysts determine the requirements of end users, especially naive and parametric end users, and develop specifications for canned transactions that meet these requirements. Application programmers implement these specifications as programs; then they test, debug, document, and maintain these canned transactions. Such analysts and programmers-commonly referred to as software engineers-should be familiar with the full range of capabilities provided by the DBMS to accomplish their tasks.In addition to those who design, use, and administer a database, others are associated with the design, development, and operation of the DBMS software and system environment. These persons are typically not interested in the database itself. We call them the "workers behind the scene," and they include the following categories. DBMS system designers and implementers are persons who design and implement the DBMS modules and interfaces as a software package. A DBMS is a very complex software system that consists of many components, or modules, including modules for implementing the catalog, processing query language, processing the interface, accessing and buffering data, controlling concurrency, and handling data recovery and security. The DBMS must interface with other system software, such as the operating system and compilers for various programming languages. Tool developers include persons who design and implement tools-the software packages that facilitate database system design and use and that help improve performance. Tools are optional packages that are often purchased separately. They include packages for database design, performance monitoring, natural language or graphical interfaces, prototyping, simulation, and test data generation. In many cases, independent software vendors develop and market these tools. Operators and maintenance personnel are the system administration personnel who are responsible for the actual running and maintenance ofthe hardware and software environment for the database system.Although these categories of workers behind the scene are instrumental in making the database system available to end users, they typically do not use the database for their own purposes.

Advantages and disadvantages of using databases 1. Controlling Redundancy:In file system, each application has its own private files, which cannot be shared between multiple applications. This can often lead to considerable redundancy in the stored data, which results in wastage of storage space. By having centralized database most of this can be avoided. It is not possible that all redundancy should be eliminated. Sometimes there are sound business and technical reasons for maintaining multiple copies of the same data. In a database system, however this redundancy can be controlled.For example:In case of college database, there may be the number of applications like General Office, Library, Account Office, Hostel etc. Each of these applications may maintain the following information into own private file applications:It is clear from the above file systems, that there is some common data of the student which has to be mentioned in each application, like Rollno, Name, Class, Phone_No~ Address etc. This will cause the problem of redundancy which results in wastage of storage space and difficult to maintain, but in case of centralized database, data can be shared by number of applications and the whole college can maintain its computerized data with the following database:It is clear in the above database that Rollno, Name, Class, Father_Name, Address, Phone_No, Date_of_birth which are stored repeatedly in file system in each application, need not be stored repeatedly in case of database, because every other application can access this information by joining of relations on the basis of common column i.e. Rollno. Suppose any user of Library system need the Name, Address of any particular student and by joining of Library and General Office relations on the basis of column Rollno he/she can easily retrieve this information.Thus, we can say that centralized system of DBMS reduces the redundancy of data to great extent but cannot eliminate the redundancy because RollNo is still repeated in all the relations.2.Integrity can be enforced:Integrity of data means that data in database is always accurate, such that incorrect information cannot be stored in database. In order to maintain the integrity of data, some integrity constraints are enforced on the database. A DBMS should provide capabilities for defining and enforcing the constraints.For Example: Let us consider the case of college database and suppose that college having only BTech, MTech, MSc, BCA, BBA and BCOM classes. But if a \.,ser enters the class MCA, then this incorrect information must not be stored in database and must be prompted that this is an invalid data entry. In order to enforce this, the integrity constraint must be applied to the class attribute of the student entity. But, in case of file system tins constraint must be enforced on all the application separately (because all applications have a class field).In case of DBMS, this integrity constraint is applied only once on the class field of the General Office (because class field appears only once in the whole database), and all other applications will get the class information about the student from the General Office table so the integrity constraint is applied to the whole database. So, we can conclude that integrity constraint can be easily enforced in centralized DBMS system as compared to file system.3.Inconsistency can be avoided: When the same data is duplicated and changes are made at one site, which is not propagated to the other site, it gives rise to inconsistency and the two entries regarding the same data will not agree. At such times the data is said to be inconsistent. So, if the redundancy is removed chances of having inconsistent data is also removed.Let us again, consider the college system and suppose that in case of General_Office file it is indicated that Roll_Number 5 lives in Amritsar but in library file it is indicated thatRoll_Number 5 lives in Jalandhar. Then, this is a state at which tIle two entries of the same object do not agree with each other (that is one is updated and other is not). At such time the database is said to be inconsistent.An inconsistent database is capable of supplying incorrect or conflicting information. So there should be no inconsistency in database. It can be clearly shown that inconsistency can be avoided in centralized system very well as compared to file system.Let us consider again, the example of college system and suppose that RollNo 5 is .shifted from Amritsar to Jalandhar, then address information of Roll Number 5 must be updated, whenever Roll number and address occurs in the system. In case of file system, the information must be updated separately in each application, but if we make updation only at three places and forget to make updation at fourth application, then the whole system show the inconsistent results about Roll Number 5.In case of DBMS, Roll number and address occurs together only single time in General_Office table. So, it needs single updation and then another application retrieve the address information from General_Office which is updated so, all application will get the current and latest information by providing single update operation and this single update operation is propagated to the whole database or all other application automatically, this property is called as Propagation of Update.We can say the redundancy of data greatly affect the consistency of data. If redundancy is less, it is easy to implement consistency of data. Thus, DBMS system can avoid inconsistency to great extent.4.Data can be shared:As explained earlier, the data about Name, Class, Father __name etc. of General_Office is shared by multiple applications in centralized DBMS as compared to file system so now applications can be developed to operate against the same stored data. The applications may be developed without having to create any new stored files.5. Standards can be enforced: Since DBMS is a central system, so standard can be enforced easily may be at Company level, Department level, National level or International level. The standardized data is very helpful during migration or interchanging of data. The file system is an independent system so standard cannot be easily enforced on multiple independent applications.6.Restricting unauthorized access:When multiple users share a database, it is likely that some users will not be authorized to access all information in the database. For example, account office data is often considered confidential, and hence only authorized persons are allowed to access such data. In addition, some users may be permitted only to retrieve data, whereas other are allowed both to retrieve and to update. Hence, the type of access operation retrieval or update must also be controlled. Typically, users or user groups are given account numbers protected by passwords, which they can use to gain access to the database. A DBMS should provide a security and authorization subsystem, which the DBA uses to create accounts and to specify account restrictions. The DBMS should then enforce these restrictions automatically.7.Solving Enterprise Requirement than Individual Requirement:Since many types of users with varying level of technical knowledge use a database, a DBMS should provide a variety of user interface. The overall requirements of the enterprise are more important than the individual user requirements. So, the DBA can structure the database system to provide an overall service that is "best for the enterprise".For example: A representation can be chosen for the data in storage that gives fast access for the most important application at the cost of poor performance in some other application. But, the file system favors the individual requirements than the enterprise requirements8.Providing Backup and Recovery:A DBMS must provide facilities for recovering from hardware or software failures. The backup and recovery subsystem of the DBMS is responsible for recovery. For example, if the computer system fails in the middle of a complex update program, the recovery subsystem is responsible for making sure that the .database is restored to the state it was in before the program started executing.9.Cost of developing and maintaining system is lower:It is much easier to respond to unanticipated requests when data is centralized in a database than when it is stored in a conventional file system. Although the initial cost of setting up of a database can be large, but the cost of developing and maintaining application programs to be far lower than for similar service using conventional systems. The productivity of programmers can be higher in using non-procedural languages that have been developed with DBMS than using procedural languages.10. DataModel can be developed:The centralized system is able to represent the complex data and interfile relationships, which results better data modeling properties. The data madding properties of relational model is based on Entity and their Relationship, which is discussed in detail in chapter 4 of the book.11.Concurrency Control:DBMS systems provide mechanisms to provide concurrent access of data to multiple users.Disadvantages of DBMSThe disadvantages of the database approach are summarized as follows:1.Complexity:The provision of the functionality that is expected of a good DBMS makes the DBMS an extremely complex piece of software. Database designers, developers, database administrators and end-users must understand this functionality to take full advantage of it. Failure to understand the system can lead to bad design decisions, which can have serious consequences for an organization.2.Size:The complexity and breadth of functionality makes the DBMS an extremely large piece of software, occupying many megabytes of disk space and requiring substantial amounts ofmemoryto run efficiently.3.Performance:Typically, a File Based system is written for a specific application, such as invoicing. As result, performance is generally very good. However, the DBMS is written to be more general, to cater for many applications rather than just one. The effect is that some applications may not run as fast as they used to.4.Higher impact of a failure:The centralization of resources increases the vulnerability of the system. Since all users and applications rely on the ~vailabi1ity of the DBMS, the failure of any component can bring operations to a halt.5.Cost of DBMS:The cost of DBMS varies significantly, depending on the environment and functionality provided. There is also the recurrent annual maintenance cost.6. Additional Hardware costs:The disk storage requirements for the DBMS and the database may necessitate the purchase of additional storage space. Furthermore, to achieve the required performance it may be necessary to purchase a larger machine, perhaps even a machine dedicated to running the DBMS. The procurement of additional hardware results in further expenditure.7. Cost of Conversion:In some situations, the cost of the DBMS and extra hardware may be insignificant compared with the cost of converting existing applications to run on the new DBMS and hardware. This cost also includes the cost of training staff to use these new systems and possibly the employment of specialist staff to help with conversion and running of the system. This cost is one of the main reasons why some organizations feel tied to their current systems and cannot switch to modern database technology.Database ArchitectureDBMSs do not all conform to the same architecture. The three-level architecture forms the basis of modern database architectures. This is in agreement with the ANSI/SPARC study group on Database Management Systems. ANSI/SPARC is the American National Standards Institute/Standard Planning and Requirement Committee). The architecture for DBMSs is divided into three general levels: external conceptual internal Three level database architecture

Figure 1: Three level architecture 1. the external level : concerned with the way individual users see the data 2. the conceptual level : can be regarded as a community user view a formal description of data of interest to the organization, independent of any storage considerations. 3. the internal level : concerned with the way in which the data is actually stored

Figure 2 : How the three level architecture works External ViewA user is anyone who needs to access some portion of the data. They may range from application programmers to casual users with adhoc queries. Each user has a language at his/her disposal. The application programmer may use a high level language (eg. COBOL) while the casual user will probably use a query language. Regardless of the language used, it will include a data sublanguage DSL which is that subset of the language which is concerned with storage and retrieval of information in the database and may or may not be apparent to the user. A DSL is a combination of two languages: a data definition language (DDL) - provides for the definition or description of database objects a data manipulation language (DML) - supports the manipulation or processing of database objects. Each user sees the data in terms of an external view: Defined by an external schema, consisting basically of descriptions of each of the various types of external record in that external view, and also a definition of the mapping between the external schema and the underlying conceptual schema. Conceptual View An abstract representation of the entire information content of the database. It is in general a view of the data as it actually is, that is, it is a `model' of the `realworld'. It consists of multiple occurrences of multiple types of conceptual record, defined in the conceptual schema. To achieve data independence, the definitions of conceptual records must involve information content only. storage structure is ignored access strategy is ignored In addition to definitions, the conceptual schema contains authorization and validation procedures. Internal ViewThe internal view is a low-level representation of the entire database consisting of multiple occurrences of multiple types of internal (stored) records. It is however at one remove from the physical level since it does not deal in terms of physical records or blocks nor with any device specific constraints such as cylinder or track sizes. Details of mapping to physical storage is highly implementation specific and are not expressed in the three-level architecture. The internal view described by the internal schema: defines the various types of stored record what indices exist how stored fields are represented what physical sequence the stored records are in In effect the internal schema is the storage structure definition. Mappings The conceptual/internal mapping: defines conceptual and internal view correspondence specifies mapping from conceptual records to their stored counterparts An external/conceptual mapping: defines a particular external and conceptual view correspondence A change to the storage structure definition means that the conceptual/internal mapping must be changed accordingly, so that the conceptual schema may remain invariant, achieving physical data independence. A change to the conceptual definition means that the conceptual/external mapping must be changed accordingly, so that the external schema may remain invariant, achieving logical data independence. Database languagesOnce the design of a database is completed and a DBMS is chosen to implement the database, the first order of the day is to specify conceptual and internal schemas for the database and any mappings between the two. In many DBMSs where no strict separation of levels is maintained, one language, called the data definition language (OOL), is used by the DBA and by database designers to define both schemas. The DBMS will have a DDL compiler whose function is to process LJDL statements in order to identify descriptions of the schema constructs and to store the schema description in the DBMS catalog. In DBMSs where a clear separation is maintained between the conceptual and internal levels, the DDL is used to specify the conceptual schema only. Another language, the storage definition language (SOL), is used to specify the internal schema. The mappings between the two schemas may be specified in either one of these languages. For a true three-schema architecture, we would need a third language, the view definition language (VDL), to specify user views and their mappings to the conceptual schema, but in most DBMSs the DDL is used to define both conceptual and external schemas. Once the database schemas arc compiled and the database is populated with data, users must have some means to manipulate the database. Typical manipulations include retrieval, insertion, deletion, and modification of the data. The DBMS provides a set of operations or a language called the data manipulation language (OML) for these purposes. In current DBMSs, the preceding types of languages are usually not considered distinct languages; rather, a comprehensive integrated language is used that includes constructs for conceptual schema definition, view definition and data manipulation. Storage definition is typically kept separate, since it is used for defining physical storage structures to fine tune the performance of the database system, which is usually done by the DBA staff. A typical example of a comprehensive database language is the SQL relational database language which represents a combination of DDL, VDL, and DML, as well as statements for constraint specification, schema evolution, and other features. The SDL was a component in early versions of SQL but has been removed from the language to keep it at the conceptual and external levels only. Categories of Data ModelsMany data models have been proposed, which we can categorize according to the types of concepts they use to describe the database structure. High-level or conceptual data models provide concepts that are close to the way many users perceive data, whereas low-level or physical data models provide concepts that describe the details of how data is stored in the computer. Concepts provided by low-level data models are generally meant for computer specialists, not for typical end users. Between these two extremes is a class of representational (or implementation) data models, which provide concepts that may be understood by end users but that are not too far removed from the way data is organized within the computer. Representational data models hide some details of data storage but can be implemented on a computer system in a direct way. Conceptual data models use concepts such as entities, attributes, and relationships. An entity represents a real-world object or concept, such as an employee or a project, that is described in the database. An attribute represents some property of interest that further describes an entity, such as the employee's name or salary. A relationship among two or more entities represents an association among two or more entities, for example, a works-on relationship between an employee and a project. Representational or implementation data models are the models used most frequently in traditional commercial DBMSs. These include the widely used relational data model, as well as the so-called legacy data models-the network and hierarchical models-that have been widely used in the past. Representational data models represent data by using record structures and hence are sometimes called record-based data models. We can regard object data models as a new family of higher-level implementation data models that are closer to conceptual data models. Object data models are also frequently utilized as high-level conceptual models, particularly in the software engineering domain. Physical data models describe how data is stored as files in the computer by representing information such as record formats, record orderings, and access paths. An access path is a structure that makes the search for particular database records efficient.Conceptual modellingThe Conceptual Design phase takes the high-level data model and converts into a conceptual schema, which is specific to a particular DBMS class (e.g. relational). For a relational system, such as Oracle, an appropriate conceptual schema would be relations. Finally, in the Physical Design phase the conceptual schema is converted into database internal structures. This is specific to a particular DBMS product. Basics Entity Relationship (ER) modelling is a design tool is a graphical representation of the database system provides a high-level conceptual data model supports the user's perception of the data is DBMS and hardware independent had many variants is composed of entities, attributes, and relationships Entities An entity is any object in the system that we want to model and store information about Individual objects are called entities Groups of the same type of objects are called entity types or entity sets Entities are represented by rectangles (either with round or square corners)

Figure: Entities There are two types of entities; weak and strong entity types. Attribute All the data relating to an entity is held in its attributes. An attribute is a property of an entity. Each attribute can have any value from its domain. Each entity within an entity type: May have any number of attributes. Can have different attribute values than that in any other entity. Have the same number of attributes. Attributes can be simple or composite single-valued or multi-valued Attributes can be shown on ER models They appear inside ovals and are attached to their entity. Note that entity types can have a large number of attributes... If all are shown then the diagrams would be confusing. Only show an attribute if it adds information to the ER diagram, or clarifies a point.

Figure : Attributes Keys A key is a data item that allows us to uniquely identify individual occurrences or an entity type. A candidate key is an attribute or set of attributes that uniquely identifies individual occurrences or an entity type. An entity type may have one or more possible candidate keys, the one which is selected is known as the primary key. A composite key is a candidate key that consists of two or more attributes The name of each primary key attribute is underlined. Relationships A relationship type is a meaningful association between entity types A relationship is an association of entities where the association includes one entity from each participating entity type. Relationship types are represented on the ER diagram by a series of lines. As always, there are many notations in use today... In the original Chen notation, the relationship is placed inside a diamond, e.g. managers manage employees:

Figure : Chens notation for relationships For this module, we will use an alternative notation, where the relationship is a label on the line. The meaning is identical

Figure : Relationships used in this document Degree of a Relationship The number of participating entities in a relationship is known as the degree of the relationship. If there are two entity types involved it is a binary relationship type

Figure : Binary Relationships If there are three entity types involved it is a ternary relationship type

Figure : Ternary relationship It is possible to have a n-ary relationship (e.g. quaternary or unary). Unary relationships are also known as a recursive relationship.

Figure : Recursive relationship It is a relationship where the same entity participates more than once in different roles. In the example above we are saying that employees are managed by employees. If we wanted more information about who manages whom, we could introduce a second entity type called manager. Degree of a Relationship It is also possible to have entities associated through two or more distinct relationships.

Figure : Multiple relationships In the representation we use it is not possible to have attributes as part of a relationship. To support this other entity types need to be developed. Replacing ternary relationships When ternary relationships occurs in an ER model they should always be removed before finishing the model. Sometimes the relationships can be replaced by a series of binary relationships that link pairs of the original ternary relationship.

Figure : A ternary relationship example This can result in the loss of some information - It is no longer clear which sales assistant sold a customer a particular product. Try replacing the ternary relationship with an entity type and a set of binary relationships. Relationships are usually verbs, so name the new entity type by the relationship verb rewritten as a noun. The relationship sells can become the entity type sale.

Figure : Replacing a ternary relationship So a sales assistant can be linked to a specific customer and both of them to the sale of a particular product. This process also works for higher order relationships. Cardinality Relationships are rarely one-to-one For example, a manager usually manages more than one employee This is described by the cardinality of the relationship, for which there are four possible categories. One to one (1:1) relationship One to many (1:m) relationship Many to one (m:1) relationship Many to many (m:n) relationship On an ER diagram, if the end of a relationship is straight, it represents 1, while a "crow's foot" end represents many. A one to one relationship - a man can only marry one woman, and a woman can only marry one man, so it is a one to one (1:1) relationship

Figure : One to One relationship example A one to may relationship - one manager manages many employees, but each employee only has one manager, so it is a one to many (1:n) relationship

Figure : One to Many relationship example A many to one relationship - many students study one course. They do not study more than one course, so it is a many to one (m:1) relationship

Figure : Many to One relationship example A many to many relationship - One lecturer teaches many students and a student is taught by many lecturers, so it is a many to many (m:n) relationship

Figure : Many to Many relationship example

Optionality A relationship can be optional or mandatory. If the relationship is mandatory an entity at one end of the relationship must be related to an entity at the other end. The optionality can be different at each end of the relationship For example, a student must be on a course. This is mandatory. To the relationship `student studies course' is mandatory. But a course can exist before any students have enrolled. Thus the relationship `course is_studied_by student' is optional. To show optionality, put a circle or `0' at the `optional end' of the relationship. As the optional relationship is `course is_studied_by student', and the optional part of this is the student, then the `O' goes at the student end of the relationship connection.

Figure : Optionality example It is important to know the optionality because you must ensure that whenever you create a new entity it has the required mandatory links. Entity Sets Sometimes it is useful to try out various examples of entities from an ER model. One reason for this is to confirm the correct cardinality and optionality of a relationship. We use an `entity set diagram' to show entity examples graphically. Consider the example of `course is_studied_by student'.

Figure : Entity set example

Confirming Correctness

Figure : Entity set confirming errors Use the diagram to show all possible relationship scenarios. Go back to the requirements specification and check to see if they are allowed. If not, then put a cross through the forbidden relationships This allows you to show the cardinality and optionality of the relationship Deriving the relationship parameters To check we have the correct parameters (sometimes also known as the degree) of a relationship, ask two questions: 1. One course is studied by how many students? Answer = `zero or more'. This gives us the degree at the `student' end. The answer `zero or more' needs to be split into two parts. The `more' part means that the cardinality is `many'. The `zero' part means that the relationship is `optional'. If the answer was `one or more', then the relationship would be `mandatory'. 2. One student studies how many courses? Answer = `One' This gives us the degree at the `course' end of the relationship. The answer `one' means that the cardinality of this relationship is 1, and is `mandatory' If the answer had been `zero or one', then the cardinality of the relationship would have been 1, and be `optional'. Redundant relationships Some ER diagrams end up with a relationship loop. check to see if it is possible to break the loop without losing info Given three entities A, B, C, where there are relations A-B, B-C, and C-A, check if it is possible to navigate between A and C via B. If it is possible, then A-C was a redundant relationship. Always check carefully for ways to simplify your ER diagram. It makes it easier to read the remaining information. Redundant relationships example Consider entities `customer' (customer details), `address' (the address of a customer) and `distance' (distance from the company to the customer address).

Figure : Redundant relationship Splitting n:m Relationships A many to many relationship in an ER model is not necessarily incorrect. They can be replaced using an intermediate entity. This should only be done where: the m:n relationship hides an entity the resulting ER diagram is easier to understand. Splitting n:m Relationships - Example Consider the case of a car hire company. Customers hire cars, one customer hires many card and a car is hired by many customers.

Figure : Many to Many example The many to many relationship can be broken down to reveal a `hire' entity, which contains an attribute `date of hire'.

Figure : Splitting the Many to Many example Constructing an ER model Before beginning to draw the ER model, read the requirements specification carefully. Document any assumptions you need to make. 1. Identify entities - list all potential entity types. These are the object of interest in the system. It is better to put too many entities in at this stage and them discard them later if necessary. 2. Remove duplicate entities - Ensure that they really separate entity types or just two names for the same thing. Also do not include the system as an entity type e.g. if modelling a library, the entity types might be books, borrowers, etc. The library is the system, thus should not be an entity type. 3. List the attributes of each entity (all properties to describe the entity which are relevant to the application). Ensure that the entity types are really needed. are any of them just attributes of another entity type? if so keep them as attributes and cross them off the entity list. Do not have attributes of one entity as attributes of another entity! 4. Mark the primary keys. Which attributes uniquely identify instances of that entity type? This may not be possible for some weak entities. 5. Define the relationships Examine each entity type to see its relationship to the others. 6. Describe the cardinality and optionality of the relationships Examine the constraints between participating entities. 7. Remove redundant relationships Examine the ER model for redundant relationships. ER modelling is an iterative process, so draw several versions, refining each one until you are happy with it. Note that there is no one right answer to the problem, but some solutions are better than others! Entity Relationship Modelling - 2 Country Bus Company A Country Bus Company owns a number of busses. Each bus is allocated to a particular route, although some routes may have several busses. Each route passes through a number of towns. One or more drivers are allocated to each stage of a route, which corresponds to a journey through some or all of the towns on a route. Some of the towns have a garage where busses are kept and each of the busses are identified by the registration number and can carry different numbers of passengers, since the vehicles vary in size and can be single or double-decked. Each route is identified by a route number and information is available on the average number of passengers carried per day for each route. Drivers have an employee number, name, address, and sometimes a telephone number. Entities Bus - Company owns busses and will hold information about them. Route - Buses travel on routes and will need described. Town - Buses pass through towns and need to know about them Driver - Company employs drivers, personnel will hold their data. Stage - Routes are made up of stages Garage - Garage houses buses, and need to know where they are. Relationships A bus is allocated to a route and a route may have several buses. Bus-route (m:1) is serviced by A route comprises of one or more stages. route-stage (1:m) comprises One or more drivers are allocated to each stage. driver-stage (m:1) is allocated A stage passes through some or all of the towns on a route. stage-town (m:n) passes-through A route passes through some or all of the towns route-town (m:n) passes-through Some of the towns have a garage garage-town (1:1) is situated A garage keeps buses and each bus has one `home' garage garage-bus (m:1) is garaged Draw E-R Diagram

Figure : Bus Company Attributes Bus (reg-no,make,size,deck,no-pass) Route (route-no,avg-pass) Driver (emp-no,name,address,tel-no) Town (name) Stage (stage-no) Garage (name,address) Problems with ER Models There are several problems that may arise when designing a conceptual data model. These are known as connection traps. There are two main types of connection traps: 1. fan traps 2. chasm traps Fan traps A fan trap occurs when a model represents a relationship between entity types, but the pathway between certain entity occurrences is ambiguous. It occurs when 1:m relationships fan out from a single entity.

Figure : Fan Trap A single site contains many departments and employs many staff. However, which staff work in a particular department? The fan trap is resolved by restructuring the original ER model to represent the correct association.

Figure : Resolved Fan Trap Chasm traps A chasm trap occurs when a model suggests the existence of a relationship between entity types, but the pathway does not exist between certain entity occurrences. It occurs where there is a relationship with partial participation, which forms part of the pathway between entities that are related.

Figure : Chasm Trap A single branch is allocated many staff who oversee the management of properties for rent. Not all staff oversee property and not all property is managed by a member of staff. What properties are available at a branch? The partial participation of Staff and Property in the oversees relation means that some properties cannot be associated with a branch office through a member of staff. We need to add the missing relationship which is called `has' between the Branch and the Property entities. You need to therefore be careful when you remove relationships which you consider to be redundant.

Figure : Resolved Chasm Trap Enhanced ER Models (EER) The basic concepts of ER modelling is not powerful enough for some complex applications... We require some additional semantic modelling concepts: Specialisation Generalisation Categorisation Aggregation First we need some new entity constructs. Superclass - an entity type that includes distinct subclasses that require to be represented in a data model. Subclass - an entity type that has a distinct role and is also a member of a superclass.

Figure : Superclass and subclasses Subclasses need not be mutually exclusive; a member of staff may be a manager and a sales person. The purpose of introducing superclasses and subclasses is to avoid describing types of staff with possibly different attributes within a single entity. This could waste space and you might want to make some attributes mandatory for some types of staff but other staff would not need these attributes at all. Specialisation This is the process of maximising the differences between members of an entity by identifying their distinguishing characteristics. Staff(staff_no,name,address,dob) Manager(bonus) Secretary(wp_skills) Sales_personnel(sales_area, car_allowance)

Figure : Specialisation in action Here we have shown that the manages relationship is only applicable to the Manager subclass, whereas the works_for relationship is applicable to all staff. It is possible to have subclasses of subclasses. Generalisation Generalisation is the process of minimising the differences between entities by identifying common features. This is the identification of a generalised superclass from the original subclasses. This is the process of identifying the common attributes and relationships. For instance, taking: car(regno,colour,make,model,numSeats) motorbike(regno,colour,make,model,hasWindshield)And forming: vehicle(regno,colour,make,model,numSeats,hasWindshielf)In this case vehicle has numSeats which would be NULL if the vehicle was a motorbike, and has hasWindshield which would be NULL if it was a car. Mapping ER Models into RelationsWhat is a relation? A relation is a table that holds the data we are interested in. It is two-dimensional and has rows and columns. Each entity type in the ER model is mapped into a relation. The attributes become the columns. The individual entities become the rows.

Figure : a relation Relations can be represented textually as: tablename(primary key, attribute 1, attribute 2, ... , foreign key)If matric_no was the primary key, and there were no foreign keys, then the table above could be represented as: student(matric no, name, address, date_of_birth)When referring to relations or tables, cardinality is considered to the the number of rows in the relation or table, and arity is the number of columns in a table or attributes in a relation. Foreign keys A foreign key is an attribute (or group of attributes) that is the primary key to another relation. Roughly, each foreign key represents a relationship between two entity types. They are added to relations as we go through the mapping process. They allow the relations to be linked together. A relation can have several foreign keys. It will generally have a foreign key from each table that it is related to. Foreign keys are usually shown in italics or with a wiggly underline. Preparing to map the ER model Before we start the actual mapping process we need to be certain that we have simplified the ER model as much as possible. This is the ideal time to check the model, as it is really the last chance to make changes to the ER model without causing major complications. Mapping 1:1 relationships Before tackling a 1:1 relationship, we need to know its optionality. There are three possibilities the relationship can be: 1. mandatory at both ends 2. mandatory at one end and optional at the other 3. optional at both ends Mandatory at both ends If the relationship is mandatory at both ends it is often possible to subsume one entity type into the other. The choice of which entity type subsumes the other depends on which is the most important entity type (more attributes, better key, semantic nature of them). The result of this amalgamation is that all the attributes of the `swallowed up' entity become attributes of the more important entity. The key of the subsumed entity type becomes a normal attribute. If there are any attributes in common, the duplicates are removed. The primary key of the new combined entity is usually the same as that of the original more important entity type. When not to combine There are a few reason why you might not combine a 1:1 mandatory relationship. the two entity types represent different entities in the `real world'. the entities participate in very different relationships with other entities. efficiency considerations when fast responses are required or different patterns of updating occur to the two different entity types. If not combined... If the two entity types are kept separate then the association between them must be represented by a foreign key. The primary key of one entity type comes the foreign key in the other. It does not matter which way around it is done but you should not have a foreign key in each entity. Example Two entity types; staff and contract. Each member of staff must have one contract and each contract must have one member of staff associated with it. It is therefore a mandatory relations at both ends.

Figure : 1:1 mandatory relationship These to entity types could be amalgamated into one. Staff(emp_no, name, cont_no, start, end, position, salary) or kept apart and a foreign key used Staff(emp_no, name, contract_no)Contract(cont_no, start, end, position, salary) or Staff(emp_no, name)Contract(cont_no, start, end, position, salary, emp_no)Mandatory Optional The entity type of the optional end may be subsumed into the mandatory end as in the previous example. It is better NOT to subsume the mandatory end into the optional end as this will create null entries.

Figure : 1:1 with 1 optional end If we add to the specification that each staff member may have at most one contract (thus making the relation optional at one end). Map the foreign key into Staff - the key is null for staff without a contract. Staff(emp_no, name, contract_no)Contract(cont_no, start, end, position, salary) Map the foreign key into Contract - emp_no is mandatory thus never null. Staff(emp_no, name)Contract(cont_no, start, end, position, salary, emp_no)ExampleConsider this example: Staff Gordon, empno 10, contract no 11. Staff Andrew, empno 11, no contract. Contract 11, from 1st Jan 2001 to 10th Jan 2001, lecturer, on 2.00 a year. Foreign key in Staff:Contract Table: Cont_noStartEndPositionSalary

111st Jan 200110th Jan 2001Lecturer2.00

Staff Table: EmpnoNameContract No

10Gordon11

11AndrewNULL

However, Foreign key in Contract: Contract Table: Cont_noStartEndPositionSalaryEmpno

111st Jan 200110th Jan 2001Lecturer2.0010

Staff Table: EmpnoName

10Gordon

11Andrew

As you can see, both ways store the same information, but the second way has no NULLs. Mandatory Optional - Subsume? The reasons for not subsuming are the same as before with the following additional reason. very few of the entities from the mandatory end are involved in the relationship. This could cause a lot of wasted space with many blank or null entries.

Figure : 1 optional end If only a few lecturers manage courses and Course is subsumed into Lecturer then there would be many null entries in the table. Lecturer(lect_no, l_name, cno, c_name, type, yr_vetted, external) It would be better to keep them separate. Lecturer(lect_no, l_name)Course(cno, c_name, type, yr_vetted, external,lect_no)Summary... So for 1:1 optional relationships, take the primary key from the `mandatory end' and add it to the `optional end' as a foreign key. So, given entity types A and B, where A B is a relationship where the A end it optional, the result would be: A (primary key,attribute,...,foreign key to B)B (primary key,attribute,...)

Optional at both ends... Such examples cannot be amalgamated as you could not select a primary key. Instead, one foreign key is used as before.

Figure : 2 optional end Each staff member may lease up to one car Each car may be leased by at most one member of staff If these were combined together... Staff_car(emp_no, name, reg_no, year, make, type, colour)what would be the primary key? If emp_no is used then all the cars which are not being leased will not have a key. Similarly, if the reg_no is used, all the staff not leasing a car will not have a key. A compound key will not work either. Mapping 1:m relationships To map 1:m relationships, the primary key on the `one side' of the relationship is added to the `many side' as a foreign key. For example, the 1:m relationship `course-student':

Figure : Mapping 1:m relationships Assuming that the entity types have the following attributes: Course(course_no, c_name)Student(matric_no, st_name, dob) Then after mapping, the following relations are produced: Course(course_no, c_name)Student(matric_no, st_name, dob, course_no) If an entity type participates in several 1:m relationships, then you apply the rule to each relationship, and add foreign keys as appropriate. Mapping n:m relationships If you have some m:n relationships in your ER model then these are mapped in the following manner. A new relation is produced which contains the primary keys from both sides of the relationship These primary keys form a composite primary key.

Figure : Mapping n:m relationships Thus Student(matric_no, st_name, dob)Module(module_no, m_name, level, credits) becomes Student(matric_no, st_name, dob)Module(module_no, m_name, level, credits)Studies(matric_no,module_no)This is equivalent to:

Figure : After Mapping a n:m relationship Student(matric_no,st_name,dob)Module(module_no,m_name,level,credits)Study()

37