lu10 notes - dbwiki - classes.ischool.syr.edu -...

LAB FOR LU12 – PHYSICAL DATABASE DESIGN AND DATA CONVERSION

In this lab you will learn the basics of physical database design and learn how to use SQL to perform data conversion techniques.

LEARNING OBJECTIVES

Describe the steps in the physical design phase

Explain the various data storage formats for attributes

Explain the appropriate conditions to de-normalize

Describe sequential, indexed and direct (hashed) file organizations

Explain when to use indexing as a method for improving performance

Explain the issues associated with data migration

Describe various techniques for migrating data in SQL implementations

LAB GOALS

Our lab goals are to:

1. Learn how to import data into SQL server

2. Use SQL to cleanse, transform and load our source data into the target tables.

3. Use SQL to perform data verification.

This lab consists of two parts in part one; we will walk you through a data migration scenario. In part two you will attempt a data migration on your own.

WHAT YOU WILL NEED TO BEGIN

1. Microsoft SQL Server Management Studio (for executing SQL queries).

2. Microsoft Access 2003 or higher for importing data.

3. Your account credentials on the ISTSTUDENTS server.

4. The lab zip file containing the data to import and the target SQL schema.

PART 1: DATA M IGRATION WALK-THRU

SCENARIO

You have been hired by fictitious book-selling giant Fudgenbooks to improve the design of their existing database. Fudgenbooks currently keeps inventory in a spreadsheet - which is far from ideal. You normalized their existing database and created the SQL for the new data model. The last step is to migrate the data into the new data model.

Step one: Import Source Data into SQL Server

Fudgenbooks keeps their existing database in a spreadsheet. Your first task will be to import that spreadsheet as-is into your SQL server database. One of the things Microsoft Access does very well is connect to other data sources. This makes Access a verify versatile tool for assisting in data migration.

Create a new project in Microsoft Access. Access Data projects are different from Access Databases in that the back-end data is in SQL Server. This creates a two-tier client/server model and a distributed database with location transparency.

1 / 9

1. Open up Microsoft Access from the start menu. From the Office menu, select New. Then click on [Blank Database]

2. Click on the folder icon near the filename to browse for the file destination folder

3. Select your folder then click on [SAVE AS TYPE] drop down option

Select the [Microsoft Access Projects *.adp] option. Then type a filename (no need to indicate extension). Notice the name and the extension adp. Click on create.

4. You will be asked

Select Yes

2 / 9

Be sure to check this

Make sure this is your database

Test your connection before clicking OK

5. Complete the details using the sample below.

1. From the menu, External Data Excel. From the import dialog, change the file type to Excel and select the fudgenbooks.xls file (The file is available in the LMS).

When you’re ready, select Import

2. Follow along with the Import Spreadsheet Wizard. Check First row contains column headings then click Next. Store data in a new table, click Next. Name the table fudgenbooks_import and click Finsh.

3 / 9

3. If you are successful, you should have a table in your database called fudgenbooks_import which looks like this (yes, it is not a lot of data, but it is enough )

At this point, you no longer need Microsoft Access. It has done its job by helping us get the spreadsheet data into SQL server.

Step two: Create target database schema

In this next step you will use SQL Server Management Studio to create the target database schema (tables, keys and constrains). It should be noted that this is not a typical step in the data migration process, since usually the target schema already exists.

1. Open SQL Server Management Studio and connect to your SQL server instance.

2. From the menu choose File Open File. Open the file from the zip file archive called create_fudgenbooks_schema.sql. After the file opens, execute the SQL script to create the 4 fb_* tables.

3. Create a database diagram by adding all the tables that begin with fb_ this should create a diagram like this:

4. Save the diagram as Fudgenbooks.

Step three: Data cleansing

The next step in the migration process is data cleansing. In this step you “clean up” the source data so that things will be nice and neat when we do the data transformations. What do we need to clean up? Take a look at the following:

To clean the data up we run SQL update statements on the source data. Open a new query window in SQL Management Studio and type the following:

UPDATE fudgenbooks_import SET Subj1 = 'MySQL' WHERE lower(Subj1) = 'mysql';

UPDATE fudgenbooks_import SET Subj2 = 'MySQL' WHERE lower(Subj2) = 'mysql';

UPDATE fudgenbooks_import SET Subj1 = 'MSSQL' WHERE lower(Subj1) = 'mssql';

UPDATE fudgenbooks_import SET Subj2 = 'MSSQL' WHERE lower(Subj2) = 'mssql';

4 / 9

The data here represents the same thing, but it is represented differently ‘mysql’ vs. ‘MySQL’ this need to be corrected before we transform the

The resultant source data changes to look like this:

Now that the data is clean, we’re ready to transform!

Step four: Transform the source data and load into the target tables

This next step is the critical step. Here’s where we move data from the source to the destination tables. In looking at our target schema, we must start our import into the tables without foreign keys. Therefore we first migrate data into the fb_publishers table, then fb_bookstore, fb_authors, and fb_subjects in that order.

Regardless of the table we’re migrating, the approach is the same. First, create the SQL SELECT statement(s) which output the data as you need it to appear in the target table. Keep in mind these SQL techniques as you attempt to accomplish the task:

The DISTINCT keyword outputs unique rows of data. This is important for building lookup tables, and tables on the “one” side of the 1-M relationship.

The UNION keyword allows you to combine like SQL statements into one output. This is useful for addressing multi-valued attributes and repeating groups of data.

To combine columns of data use the + operator. eg.[Last Name] + ‘, ‘ + [First Name] as [Full Name]

To extract one column into two columns use the CHARINDEX function. eg. left([Full Name],charindex([Full Name],’,’)) as [Last Name]

Use cast and convert to change the data type of columns

First we load up the fb_publishers table:

-- 1st step: this statement populates the publishers table

insert into fb_publishers

select distinct Pub_id, [Publisher Name], [Publisher URL] from fudgenbooks_import

Then the table data should look like this.

Next we load the fb_bookstore table:

-- 2nd this statemnt populates the fb_bookstore table

insert into fb_bookstore

select distinct convert(int, [ISBN]), [Title], [Price], [Pages], [Pub_id]

from fudgenbooks_import

And the table data looks like:

5 / 9

Finally we move on to the fb_authors and fb_subjects tables. These require union queries since columnar data needs to be moved into rows:

-- 3rd this statement populates the fb_authors table

insert into fb_authors

select Author1, convert(int,ISBN) from fudgenbooks_import

where Author1 is not null

union



union



And the table data looks like:

-- Last: this statement populates the fb_subjects

insert into fb_subjects

select Subj1, convert(int,ISBN) from fudgenbooks_import

where Subj1 is not null

union

6 / 9

select Subj2, convert(int,ISBN) from fudgenbooks_import

where Subj2 is not null

Whew! That seems like a lot of work. We did, however automate it to the point where it will work each and every time regardless of the actual data. One thing I might change in hindsight is to alter the ISBN column in the fudgenbooks_import table during the cleansing phase to avoid having to use convert(int,ISBN)(This was necessary because conversion from float to varchar contain decimals and the ISBN could have come over like this 121398742.0 !

Step five: Data VerificationThe final step is to perform data verification on the target data to make sure it does indeed have the same data as the source. The easiest (but farthest from complete) means of accomplishing this is to check row counts.

The following query compares the total row counts of the original data source and the corresponding table in the target data source.

SELECT

(select count(*) from fb_bookstore) as fb_bookstore_count,

(select count(*) from fudgenbooks_import) as fudgenbooks_count

Row counts work as long as you’re also checking to make sure the SQL commands are not generating errors. Data that errors out during transformation / loading well definitely affect row counts.

While not true data verification, it is a good idea to spot check some sample data for correctness. For example, this query joins all the tables back together to give you a view that closely resembles the original data.

SELECT b.isbn, title, price, pages, p.pub_id, pub_name, pub_url, author, subj

FROM fb_bookstore b JOIN fb_publishers p ON b.pub_id=p.pub_id

JOIN fb_authors a ON b.isbn=a.isbn

JOIN fb_subjects s on b.isbn=s.isbn

ORDER BY title

QUESTIONS

1. Data cleansing and data transformation are similar tasks, but they do differ in some significant ways. How do they differ?

2. What are the challenges associated with automating the transformation and load phases of data migration?

3. In this example why is it not possible to re-create the original source data from the target data? Hint: what normal form was the source in when we started?

7 / 9

4. Now that you’ve experienced a migration. Do you think explicit transactions might bring anything useful to the process? If so, what?

5. Many DBMSs like SQL Server have their own built-in import and export facilities which can be used for the extraction phase of migration. What might the advantages of using a 3rd party application such as Microsoft access be?

PART 2: PERFORM YOUR OWN DATA MIGRATION

In this part, you will perform your own data migration by applying the principles and techniques used in the walk-thru portion of the lab in part one.

OVERVIEW

Migrate data from the spreadsheet named fudgeconsulting.xls in the lab zip file, into the SQL schema for fudge consulting create_fudgeconsulting_schema.sql once you’ve performed the data migration, verify the data by “eyeballing” the original table against a SELECT statement that joins the target data mode back together

Here are some hints to make the process a little easier:

1. Use Microsoft Access to import the data as-is into SQL server, just like we did in part 1 of the lab.

2. Execute the SQL file to create the target schema and then create a database diagram so that you have a picture of what the internal model looks like, and can determine which tables to migrate first.

3. Microsoft Access imports the [Proj Num] and [Emp No] columns as float. During the data cleansing step, alter the fudgeconsulting_import table and change the columns data type to int. This will make the transformation / load step much easier. For example:ALTER TABLE fudgeconsulting_import ALTER COLUMN [Proj Num] intYou’ll have to do this a second time for the [Emp No] column.

4. Remember to use DISTINCT to only return the unique rows. Also when you need to split the last name and first name from the [Emp Name] use expressions. Hint: This gives you the last name: right([Emp Name], len([Emp Name]) - charindex(' ',[Emp Name]))

5. You don’t need to worry about [Total billed] it has been implemented as a calculated column with a function.

8 / 9

6. When you verify the migration run two SQL queries and compare them:

QUESTIONS

1. Write down the SQL statement you used to transform data into the fc_rates table.

2. Write down the SQL statement you used to transform data into the fc_projects table.

3. Write down the SQL statement you used to transform data into the fc_employees table.

4. Write down the SQL statement you used to transform data into the fc_consulting table.

5. Write Which SQL statement to verify the data is the same as the source. (The output should look like the bottom grid in #6 above).

9 / 9

lu10 notes - dbwiki - classes.ischool.syr.edu -...

Documents