dwdm lab

INDEX

S.No. Title Signature1 Study of:

Requirement Estimation, Conceptual Design, Logical Design in Data Warehousing.

2 Study of:Statistical Package for Social Sciences (SPSS) tool.

3 Study of:WEKA tool for Data Mining purposes.

4 Prepare a database for any user defined problem and apply various commands:

Creation Selection Insertion Deletion Joining, etc on tables using SQL.

5 Let us assume we have completed a survey of 12 people who have completed a weight reduction programming. Each person is sequentially assigned an ID.

Enter the following details for 12 persons:1. ID Number(id)2. Sex (sex)3. Height, inches (height)4. Weight before the program (before)5. Weight after the program (after)6. 8 question about extroversion (e1 through e8)

Then do following analysis using:1. Pearson Correlation2. Independent Sample T-Test3. Paired Sample T-Test

1. Study of Requirement Estimation, Conceptual Design, Logical Design in Data Warehousing.

Requirement Estimation:

The first thing that the project team should engage in is gathering requirements from end users. Because end users are typically not familiar with the data warehousing process or concept, the help of the business sponsor is essential. Requirement gathering can happen as one-to-one meetings or as Joint Application Development (JAD) sessions, where multiple people are talking about the project scope in the same meeting.

The primary goal of this phase is to identify what constitutes as a success for this particular phase of the data warehouse project. In particular, end user reporting / analysis requirements are identified, and the project team will spend the remaining period of time trying to satisfy these requirements.

Associated with the identification of user requirements is a more concrete definition of other details such as hardware sizing information, training requirements, data source identification, and most importantly, a concrete project plan indicating the finishing date of the data warehousing project.

Based on the information gathered above, a disaster recovery plan needs to be developed so that the data warehousing system can recover from accidents that disable the system. Without an effective backup and restore strategy, the system will only last until the first major disaster, and, as many data warehousing DBA's will attest, this can happen very quickly after the project goes live.

Conceptual Design:

A conceptual data model identifies the highest-level relationships between the different entities. Features of conceptual data model include:

Includes the important entities and the relationships among them. No attribute is specified.

No primary key is specified.

The figure below is an example of a conceptual data model.

From the figure above, we can see that the only information shown via the

conceptual data model is the entities that describe the data and the relationships between those entities. No other information is shown through the conceptual data model.

Logical Design:

Starting from the conceptual design it is necessary to determine the logical schema of data. We use ROLAP (Relational On-Line Analytical Processing) model to represent multidimensional data ROLAP uses the relational data model, which means that data is stored in relations. Given the DFM representation of multidimensional data, two schemas are used:

Star Schema Snowflake Schema

Star Schema:

Each dimension is represented by a relation such that: The primary key of the relation is the primary key of the dimension. The attributes of the relation describe all aggregation levels of the dimension.

A fact is represented by a relation such that: The primary key of the relation is the set of primary keys imported from all the dimension tables. The attributes of the relation are the measures of the fact.

Advantage and Disadvantage: Few joins are needed during query execution. Dimension tables are denormalized. Denormalization introduces redundancy.

Snowflake Schema:

Each (primary) dimension is represented by a relation: The primary key of the relation is the primary key of the dimension. The attributes of the relation directly depend by the primary key. A set of foreign keys is used to access information at different levels of aggregation. Such

information is part of the secondary dimensions and is stored in dedicated relations.

A fact is represented by a relation such that: The primary key of the relation is the set of primary keys imported from all and only the

primary dimension tables. The attributes of the relation are the measures of the fact.

Advantage and Disadvantage: Denormalization is reduced. Less memory space is required. A lot of joins can be required if they involve attributes in secondary dimension tables.

2. Study of Statistical Package for Social Sciences (SPSS) tool

SPSS is a computer program used for survey authoring and deployment (IBM SPSS Data Collection), data mining (IBM SPSS Modeler), text analytics, statistical analysis, and collaboration and deployment (batch and automated scoring services).

Statistics included in the base software:

Descriptive statistics: Cross tabulation, Frequencies, Descriptives, Explore, Descriptive Ratio Statistics

Bivariate statistics: Means, t-test, ANOVA, Correlation (bivariate, partial, distances), Nonparametric tests

Prediction for numerical outcomes: Linear regression

Prediction for identifying groups: Factor analysis, cluster analysis (two-step, K-means, hierarchical), Discriminant

The many features of SPSS are accessible via pull-down menus or can be programmed with a proprietary 4GL command syntax language. Command syntax programming has the benefits of reproducibility, simplifying repetitive tasks, and handling complex data manipulations and analyses. Additionally, some complex applications can only be programmed in syntax and are not accessible through the menu structure. The pull-down menu interface also generates command syntax; this can be displayed in the output, although the default settings have to be changed to make the syntax visible to the user. They can also be pasted into a syntax file using the "paste" button present in each menu. Programs can be run interactively or unattended, using the supplied Production Job Facility. Additionally a "macro" language can be used to write command language subroutines and a Python programmability extension can access the information in the data dictionary and data and dynamically build command syntax programs. The Python programmability extension, introduced in SPSS 14, replaced the less functional SAX Basic "scripts" for most purposes, although SaxBasic remains available. In addition, the Python extension allows SPSS to run any of the statistics in the free software package R. From version 14 onwards SPSS can be driven externally by a Python or a VB.NET program using supplied "plug-ins".

SPSS places constraints on internal file structure, data types, data processing and matching files, which together considerably simplify programming. SPSS datasets have a 2-dimensional table structure where the rows typically represent cases (such as individuals or households) and the columns represent measurements (such as age, sex or household income). Only 2 data types are defined: numeric and text (or "string"). All data processing occurs sequentially case-by-case through the file. Files can be matched one-to-one and one-to-many, but not many-to-many.

The graphical user interface has two views which can be toggled by clicking on one of the two tabs in the bottom left of the SPSS window. The 'Data View' shows a spreadsheet view of the cases (rows) and variables (columns). Unlike spreadsheets, the data cells can only contain numbers or text and formulas cannot be stored in these cells. The 'Variable View' displays the metadata dictionary where each row represents a

variable and shows the variable name, variable label, value label(s), print width, measurement type and a variety of other characteristics. Cells in both views can be manually edited, defining the file structure and allowing data entry without using command syntax. This may be sufficient for small datasets. Larger datasets such as statistical surveys are more often created in data entry software, or entered during computer-assisted personal interviewing, by scanning and using optical character recognition and optical mark recognition software, or by direct capture from online questionnaires. These datasets are then read into SPSS.

SPSS can read and write data from ASCII text files (including hierarchical files), other statistics packages, spreadsheets and databases. SPSS can read and write to external relational database tables via ODBC and SQL.

Statistical output is to a proprietary file format (*.spv file, supporting pivot tables) for which, in addition to the in-package viewer, a stand-alone reader can be downloaded. The proprietary output can be exported to text or Microsoft Word, PDF, Excel, and other formats. Alternatively, output can be captured as data (using the OMS command), as text, tab-delimited text, PDF, XLS, HTML, XML, SPSS dataset or a variety of graphic image formats (JPEG, PNG, BMP and EMF).

Working ProcedureYou always begin by defining a set of variables, and then you enter data for the variables to create a number of cases. For example, if you are doing an analysis of automobiles, each car in your study would be a case. The variables that define the cases could be things such as the year of manufacture, horsepower, and cubic inches of displacement. Each car in the study is defined as a single case, and each case is defined as a set of values assigned to the collection of variables. Every case has a value for each variable.

Variables have types. That is, each variable is defined as containing a specific kind of number. For example, a scale variable is a numeric measurement, such as weight or miles per gallon. Acategorical variable contains values that define a category; for example, a variable named gendercould be a categorical variable defined to contain only values 1 for female and 2 for male. Things that make sense for one type of variable don't necessarily make sense for another. For example, it makes sense to calculate the average miles per gallon, but not the average gender.

After your data is entered into SPSS — your cases are all defined by values stored in the variables — you can run an analysis. You have already finished the hard part. Running an analysis on the data is much easier than entering the data. To run an analysis, you select the one you want to run from the menu, select appropriate variables, and click the OK button. SPSS reads through all your cases, performs the analysis, and presents you with the output.

You can instruct SPSS to draw graphs and charts the same way you instruct it to do an analysis. You select the desired graph from the menu, assign variables to it, and click OK.

When preparing SPSS to run an analysis or draw a graph, the OK button is unavailable until you have made all the choices necessary to produce output. Not only does SPSS require that you select a sufficient number of variables to produce output, it also requires that you choose the right kinds of variables. If a categorical variable is required for a certain slot, SPSS will not allow you to choose any other kind. Whether the output makes sense is up to you and your data, but SPSS makes certain that the choices you make can be used to produce some kind of result.

All output from SPSS goes to the same place — a dialog box named SPSS Viewer. It opens to display the results of whatever you've done. After you have output, if you perform some action that produces more output, the new output is displayed in the same dialog box. And almost anything you do produces output.

Several limitations, however, exist with use of SPSS, including:

1. It is expensive (full version is over $1000, with annual license fees)

2. Student versions are not fully functional (i.e., there are restrictions in the number of cases and variables and to advanced statistics).

Fortunately the free replacement is available at little or no cost, and has unlimited cases and variables. Some advanced statistics (notably GLM) however are not yet supported. Being able to use and teach use of SPSS/PSPP in many university social science departments is almost a requirement.

3. Study of WEKA tool for Data Mining Algorithm

"WEKA" stands for the Waikato Environment for Knowledge Analysis, which was developed at the University of Waikato in New Zealand. WEKA is extensible and has become a collection of machine learning algorithms for solving real-world data mining problems. It is written in Java and runs on almost every platform. WEKA supports many different standard data mining tasks such as data pre-processing, classification, clustering, regression, visualization and feature selection. The basic premise of the application is to utilize a computer application that can be trained to perform machine learning capabilities and derive useful information in the form of trends and pattern. It is user friendly with a graphical interface that allows for quick set up and operation. WEKA operates on the predication that the user data is available as a flat file or relation, this means that each data object is described by a fixed number of attributes that usually are of a specific type, normal alpha-numeric or numeric values. The WEKA application allows novice users a tool to identify hidden information from database and file systems with simple to use options and visual interfaces.There are three major implemented schemes in WEKA. (1) Implemented schemes for classification. (2) Implemented schemes for numeric prediction. (3) Implemented "meta-schemes".Besides actual learning schemes, WEKA also contains a large variety of tools that can be used for pre-processing datasets, so that one can focus on algorithm without considering too much details as reading the data from files, implementing filtering algorithm and providing code to evaluate the results.

Starting with the WEKA

Following Figure is an example of the initial opening screen on a computer with Windows XP.

There are four options available on this initial screen. ♦ Simple CLI- Provides users without a graphic interface option the ability to execute commands from a

terminal window. ♦ Explorer- The graphical interface used to conduct experimentation on raw data ♦ Experimenter- this option allows users to conduct different experimental variations on data sets and perform

statistical manipulation

http://www.cs.waikato.ac.nz/ml/weka/

♦ Knowledge Flow- Basically the same functionality as Explorer with drag and drop functionality. The advantage of this option is that it supports incremental learning from previous results

Following Figure shows the opening screen with the available options.

There are six tabs: 1. Preprocess- used to choose the data file to be used by the application 2. Classify- used to test and train different learning schemes on the preprocessed data file under

experimentation 3. Cluster- used to apply different tools that identify clusters within the data file 4. Association- used to apply different rules to the data file that identify association within the data 5. Select attributes- used to apply different rules to reveal changes based on selected attributes inclusion or

exclusion from the experiment 6. Visualize- used to see what the various manipulation produced on the data set in a 2D format, in scatter plot

and bar graph output

Preprocessing

There are rules for the type of data that WEKA will accept. There are three options for presenting data into the program.

♦ Open File- allows for the user to select files residing on the local machine or recorded medium.♦ Open URL- provides a mechanism to locate a file or data source from a different location specified by

the user.♦ Open Database- allows the user to retrieve files or data from a database source provided by the user.

Once the initial data has been selected and loaded the user can select options for refining the experimental data. The options in the preprocess window include selection of optional filters to apply and the user can select or remove different attributes of the data set as necessary to identify specific information. The ability to pick from the available attributes allows users to separate different parts of the data set for clarity in the experimentation. The user can modify the attribute selection and change the relationship among the different attributes by deselecting different choices from the original data set. There are many different filtering options available within the preprocessing window and the user can select the different options based on need and type of data present.

Classify

The user has the option of applying many different algorithms to the data set that would in theory produce a representation of the information used to make observation easier. It is difficult to identify which of the options would provide the best output for the experiment. The best approach is to independently apply a mixture of the available choices and see what yields something close to the desired results. The Classify tab is where the user selects the classifier choices.

Choices of using four different test mode scenarios on the data set are: 1. Use training set 2. Supplied training set 3. Cross validation 4. Split percentage

There is the option of applying any or all of the modes to produce results that can be compared by the user. Additionally inside the test options toolbox there is a dropdown menu so the user can select various items to apply that depending on the choice can provide output options such as saving the results to file or specifying the random seed value to be applied for the classification.

Cluster The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those described in the classifier tab. They are use training set, supplied test set, percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range or for large data sets.

Associate The associate tab opens a window to select the options for associations within the data set. The user selects one of the choices and presses start to yield the results.

Visualization The last tab in the window is the visualization tab. Within the program calculations and comparisons have occurred on the data set. Selections of attributes and methods of manipulation have been chosen. The final piece of the puzzle is looking at the information that has been derived throughout the process. The user can now actually see the fruit of their efforts in a two dimensional representation of the information. The first screen that the user sees when they select the visualization option is a matrix of plots representing the different attributes within the data set plotted against the other attributes. If necessary there is a scroll bar to view all of the produced plots. The user can select a specific plot from the matrix to view its contents for analysis. A grid pattern of the plots allows the user to select the attribute positioning to their liking and for better understanding. Once a specific plot has been selected the user can change the attributes from one view to another providing flexibility.The scatter plot matrix gives the user a visual representation of the manipulated data sets for selection and analysis. The choices are the attributes across the top and the same from top to bottom giving the user easy access to pick the area of interest. Clicking on a plot brings up a separate window of the selected scatter plot. The user can then look at a visualization of the data of the attributes selected and select areas of the scatter plot with a selection window or by clicking on the points within the plot to identify the point’s specific information. Figure 10 shows the scatter plot for two attributes and the points derived from the data set. There are a few options to view the plot that could be helpful to the user. It is formatted similar to an X/Y graph yet it can show any of the attribute classes that appear on the main scatter plot matrix. This is handy when the scale of the attribute is unable to be ascertained in one axis over the other. Within the plot the points can be adjusted by utilizing a feature called jitter. This option moves the individual points so that in the event of close data points users can reveal hidden multiple occurrences within the initial plot.

Points on the plot:

♦ Polyline-- can be used to segment different values for additional visualization clarity on the plot. This is useful when there are many data points represented on the graph.

♦ Rectangle-- this tool is helpful to select instances within the graph for copying or clarification. ♦ Polygon—Users can connect points to segregate information and isolate points for reference.

4. Prepare a database for any user defined problem and apply various commands:

Creation Selection Insertion Deletion Joining

etc on tables using SQL.

The Following database was made for the Minor Project UVS using MySQL. The database UVS contains three tables which were created as follows:

1. SHOW DATABASES statement:Used for showing all the databases which are already present in the system. An output screenshot below shows all the databases which also contains our database UVS.

Syntax:

mysql> SHOW DATABASES;

Now to use our database UVS and create tables, we will use the USE statement as below:

mysql> USE database_name;

mysql> USE uvs;

2. SHOW TABLES statement:This statement is used to show all the tables that are currently present in the database.

Syntax:

mysql> SHOW TABLES;

This statement shows all the tables present in the system.

In our case, the tables are:

group_table question_table vote_table

3. DESCRIBE statement:This statement is used to see the description of a specific table. The description contains fields, type, key, default, null and extra.

Syntax:

mysql> DESCRIBE table_name;

mysql> DESCRIBE group_table;

4. INSERT statement:This statement is used to enter data or values into the table.

Syntax:

mysql> INSERT INTO table_name VALUES( ‘value1’, ‘value2’, ‘value3 ‘);

mysql> INSERT INTO group_table VALUES (‘5’, ‘gen’, ‘Questions from General Group’);

5. SELECT statement:This statement is used to select data from the table.

Syntax:mysql> SELECT coloumn_name FROM table_name; and

mysql> SELECT * FROM table_name;

SELECT when used with WHERE clause returns only specific rows from the table.

Syntax:

mysql> SELECT coloumn_name FROM table_name WHERE column_name = value;

6. UPDATE statement:The UPDATE statement is used to update existing records in a table.

Syntax:

mysql> UPDATE table_name SET column1=value, column2=value2,…. WHERE some_column=some_value;

7. DELETE statement:

This statement is used to delete rows from a table.

Syntax:

mysql> DELETE FROM table_name WHERE some_column= some_value;

5. Let us assume we have completed a survey of 12 people who have completed a weight reduction programming. Each person is sequentially assigned an ID.

Enter the following details for 12 persons:

7. ID Number(id)8. Sex (sex)9. Height, inches (height)10.Weight before the program (before)11.Weight after the program (after)12.8 question about extroversion (e1 through e8)

Then do following analysis using:

4. Pearson Correlation5. Independent Sample T-Test6. Paired Sample T-Test

Let's assume we've completed a survey of twelve people who have completed a weight reduction program. Each person is sequentially assigned an ID number. We've asked them their height, original weight, sex, weight after the weight reduction program, and eight questions from an extroversion questionnaire. First, we'll need to open SPSS from the desktop. It should look something like Figure 1 below. It's always best to plan your data set before you just randomly plug in variables. We're going to enter the data for each person as follows:

Table 1 ID number (id) sex (sex)

height, inches (height)

weight before the program (before)

weight after the program (after)

eight questions about extroversion (e1 through e8)

Since SPSS is rather specific about what you name your variables (variables are limited to certain alphanumeric characters and a length of eight characters), we're going to use the names in parenthesis as our variable names.

Figure 1: (Note: The numbers are going to coincide with the numbers in the figures so you'll know exactly where you should be as you follow this guide.)1) Double-click on the "var" at the top of the column. A dialog box will appear like in Figure 2.

2) Change the default text in the field that the arrow is pointing to ("VAR00001") to " id", the first of our variable names.

Figure 2:

3) Click on the "Type" button. This brings up the box in Figure 3.

4) Notice that the type is "Numeric."

5) Change the "Width" to 3 and the "Decimal Places" to 0. Click on Continue for this box. Then click on the OK for the first box. You've now defined the first variable.

Figure 3:

6) (See Figure 1) Click into the first white box under the "id" column. Type "1." This is the ID number of the first subject. Proceed down the column entering ID numbers from "1" to "12."

7) Double click on the top of the next column to the right of id to name it "sex." Under "Type" set the width to 1 and "Decimal Places" to 0. After you're done with that, click on the "Labels" button (see Figure 2); this will bring up a dialog box like the one in Figure 4.

8) Set the "Variable Label" to "Sex."

Figure 4:

9) Under "Value" type a 1, and under "Value Label" type "Male." Click the "Add" button. Now make "Value" 2 and "Value Label" "Female." Click the "Add" button again. Click Continue. Click OK. What we just did here is use numbers to represent the values for sex.

You'll need to put in twelve values under the sex column. Use the following data: 1, 2, 1, 2, 2, 1, 2, 1, 1, 1, 2, 2.

Now, using what you've learned so far, create the remaining variables with the data given below:

3. Variable Name: height o Type: Numeric, Width = 2, Decimal Places = 0

o Enter these data: 76, 59, 67, 65, 63, 72, 70, 68, 69, 74, 68 and 63.

4. Variable Name: before

o Type: Numeric, Width = 3, Decimal Places = 0

o Labels: Variable Label = "Weight before"


5. Variable Name: after


o Labels: Variable Label = "Weight after"


6. Variable Name: e1
























Your file should look like Figure 5. You can save your file under the pull down menu File: Save.

Figure 5:

This part of the beginner's guide to SPSS will walk you through performing a simple correlation test on the data file created in the first part. The Pearson's Product Moment Correlation Coefficient tells us how well two sets of continuous data correlate to each other. The value can fall between 0.00 (no correlation) and 1.00 (perfect correlation). A p value tells us if the Pearson's is significant or not. Generally p values under 0.05 are considered significant.

Figure 1: Pearson Correlation:

1) Select the option under Statistics: Correlate: Bivariate. This will bring up a menu like the one in Figure 2.

Figure 2:

2) Highlight "Weight before" by clicking on it once.

3) Transfer the variable to the box on the right by clicking once on the arrow button. Repeat this procedure for the "Height (inches)" variable. Now click on OK.

SPSS is going to generate some correlation data now. It should appear as in Figure 3.

Figure 3:

Now let's interpret these . . . .4) In general you'll notice that both height and weight are set up in a matrix so that their columns intersect. Where height and height intersect obviously there is going to be a perfect correlation (1.00). Of course this is unimportant (as well as painfully obvious), so there is no p value given.

5) Now for the data that actually makes sense, note the values that occur in the intersections of the height and weight columns/rows. The Pearson's value is the .876, and the significance is the .000 (which doesn't mean that it is zero, only that it is lower than .001). The asterisks (**) indicate significant values.

From this we can say that there is a significant correlation between height and weight.

Independent Sample T-Test:

The independent t-test is used to test for a difference between two independent groups (like males and females) on the means of a continuous variable.

1) Select Statistics: Compare Means: Independent Samples T-Test (Figure 1). A menu like that in Figure 2 should be displayed.

Figure 1:

Figure 2:

2) Select continuous variables that you want to test from the list.

3) Click on the arrow that will send them to the "Test Variable(s)" box.

2) Select the categorical variable from which you are going to extract the groups for comparison and send it to the "Grouping Variable" box by pressing the appropriate arrow.

4) Click on the "Define Groups" button. You are confronted with a small dialog box asking you for two groups. In this case, I'm using 1 and 2 (males and females). Click Continue when you're done. Then click OK when you're ready to get the output.

Figure 3:

5) These are descriptive statistics concerning your variables.

6) This first part is important. You see, there is a possibility for two t-tests to occur here. You have to know which one to use. When comparing groups like this, their variances must be relatively similar for the first t-test to be used. Levene's test checks for this. If the significance for Levene's test is 0.05 or below, then the "Equal Variances Not Assumed." test (the one on the bottom) is used. Otherwise you'll use the "Equal Variances Assumed" test (the one on the top). In this case the significance is 0.287, so we'll be using the "Equal Variances" one.

7) Here's your t statistic.

8) These are the degrees of freedom (df).

9) Here's your significance (two-tailed).

Paired Sample T-Test:

1) Select Statistics: Compare Means: Paired Samples T-Test (Figure 1). A menu like that in Figure 2 should be displayed.

Figure 1:

2) Highlight the two variables upon which you want to run your analysis. When you have the

two highlighted, send them over to the right column with the arrow button. You can then define more variable pairs if you wish, but if that's all you want, then just click on OK.

Figure 2:

3) This table is some relevant descriptive data concerning your variables.

Figure 3:

4) This is the T statistic.

5) This is the p-value (significance) of the T statistic.

dwdm lab

Documents

python programmability

primary keys

weight reduction

conceptual

user defined

weight reduction

graphical

makes sense