Post on 27-Aug-2014
Embed Size (px)
INDEX S.No. 1 2 3 4 Title Study of: Requirement Estimation, Conceptual Design, Logical Design in Data Warehousing. Study of: Statistical Package for Social Sciences (SPSS) tool. Study of: WEKA tool for Data Mining purposes. Prepare a database for any user defined problem and apply various commands: Creation Selection Insertion Deletion Joining, etc on tables using SQL. Let us assume we have completed a survey of 12 people who have completed a weight reduction programming. Each person is sequentially assigned an ID. Enter the following details for 12 persons: 1. ID Number(id) 2. Sex (sex) 3. Height, inches (height) 4. Weight before the program (before) 5. Weight after the program (after) 6. 8 question about extroversion (e1 through e8) Then do following analysis using: 1. Pearson Correlation 2. Independent Sample T-Test 3. Paired Sample T-Test Signature
1. Study of Requirement Estimation, Conceptual Design, Logical Design in Data Warehousing.Requirement Estimation: The first thing that the project team should engage in is gathering requirements from end users. Because end users are typically not familiar with the data warehousing process or concept, the help of the business sponsor is essential. Requirement gathering can happen as one-to-one meetings or as Joint Application Development (JAD) sessions, where multiple people are talking about the project scope in the same meeting. The primary goal of this phase is to identify what constitutes as a success for this particular phase of the data warehouse project. In particular, end user reporting / analysis requirements are identified, and the project team will spend the remaining period of time trying to satisfy these requirements. Associated with the identification of user requirements is a more concrete definition of other details such as hardware sizing information, training requirements, data source identification, and most importantly, a concrete project plan indicating the finishing date of the data warehousing project. Based on the information gathered above, a disaster recovery plan needs to be developed so that the data warehousing system can recover from accidents that disable the system. Without an effective backup and restore strategy, the system will only last until the first major disaster, and, as many data warehousing DBA's will attest, this can happen very quickly after the project goes live. Conceptual Design: A conceptual data model identifies the highest-level relationships between the different entities. Features of conceptual data model include:
Includes the important entities and the relationships among them. No attribute is specified. No primary key is specified.
The figure below is an example of a conceptual data model.
From the figure only information
above, we can see that the shown via the conceptual data
model is the entities that describe the data and the relationships between those entities. No other information is shown through the conceptual data model. Logical Design: Starting from the conceptual design it is necessary to determine the logical schema of data. We use ROLAP (Relational On-Line Analytical Processing) model to represent multidimensional data ROLAP uses the relational data model, which means that data is stored in relations. Given the DFM representation of multidimensional data, two schemas are used: Star Schema Snowflake Schema
Star Schema: Each dimension is represented by a relation such that: The primary key of the relation is the primary key of the dimension. The attributes of the relation describe all aggregation levels of the dimension. A fact is represented by a relation such that: The primary key of the relation is the set of primary keys imported from all the dimension tables. The attributes of the relation are the measures of the fact. Advantage and Disadvantage: Few joins are needed during query execution. Dimension tables are denormalized. Denormalization introduces redundancy.
Snowflake Schema: Each (primary) dimension is represented by a relation: The primary key of the relation is the primary key of the dimension. The attributes of the relation directly depend by the primary key. A set of foreign keys is used to access information at different levels of aggregation. Such information is part of the secondary dimensions and is stored in dedicated relations. A fact is represented by a relation such that: The primary key of the relation is the set of primary keys imported from all and only the primary dimension tables. The attributes of the relation are the measures of the fact. Advantage and Disadvantage: Denormalization is reduced. Less memory space is required. A lot of joins can be required if they involve attributes in secondary dimension tables.
2. Study of Statistical Package for Social Sciences (SPSS) toolSPSS is a computer program used for survey authoring and deployment (IBM SPSS Data Collection), data mining (IBM SPSS Modeler), text analytics, statistical analysis, and collaboration and deployment (batch and automated scoring services). Statistics included in the base software:
Descriptive statistics: Cross tabulation, Frequencies, Descriptives, Explore, Descriptive Ratio Statistics Bivariate statistics: Means, t-test, ANOVA, Correlation (bivariate, distances), Nonparametric tests Prediction for numerical outcomes: Linear regression Prediction for identifying hierarchical), Discriminant groups: Factor analysis, cluster analysis (two-step, K-means, partial,
The many features of SPSS are accessible via pull-down menus or can be programmed with a proprietary 4GL command syntax language. Command syntax programming has the benefits of reproducibility, simplifying repetitive tasks, and handling complex data manipulations and analyses. Additionally, some complex applications can only be programmed in syntax and are not accessible through the menu structure. The pulldown menu interface also generates command syntax; this can be displayed in the output, although the default settings have to be changed to make the syntax visible to the user. They can also be pasted into a syntax file using the "paste" button present in each menu. Programs can be run interactively or unattended, using the supplied Production Job Facility. Additionally a "macro" language can be used to write command language subroutines and a Python programmability extension can access the information in the data dictionary and data and dynamically build command syntax programs. The Python programmability extension, introduced in SPSS 14, replaced the less functional SAX Basic "scripts" for most purposes, although SaxBasic remains available. In addition, the Python extension allows SPSS to run any of the statistics in the free software package R. From version 14 onwards SPSS can be driven externally by a Python or a VB.NET program using supplied "plugins". SPSS places constraints on internal file structure, data types, data processing and matching files, which together considerably simplify programming. SPSS datasets have a 2-dimensional table structure where the rows typically represent cases (such as individuals or households) and the columns represent measurements (such as age, sex or household income). Only 2 data types are defined: numeric and text (or "string"). All data processing occurs sequentially case-by-case through the file. Files can be matched one-to-one and one-tomany, but not many-to-many. The graphical user interface has two views which can be toggled by clicking on one of the two tabs in the bottom left of the SPSS window. The 'Data View' shows a spreadsheet view of the cases (rows) and variables (columns). Unlike spreadsheets, the data cells can only contain numbers or text and formulas cannot be stored in these cells. The 'Variable View' displays the metadata dictionary where each row represents a variable and shows the variable name, variable label, value label(s), print width, measurement type and a variety of other
characteristics. Cells in both views can be manually edited, defining the file structure and allowing data entry without using command syntax. This may be sufficient for small datasets. Larger datasets such as statistical surveys are more often created in data entry software, or entered during computer-assisted personal interviewing, by scanning and using optical character recognition and optical mark recognition software, or by direct capture from online questionnaires. These datasets are then read into SPSS. SPSS can read and write data from ASCII text files (including hierarchical files), other statistics packages, spreadsheets and databases. SPSS can read and write to external relational database tables via ODBC and SQL. Statistical output is to a proprietary file format (*.spv file, supporting pivot tables) for which, in addition to the in-package viewer, a stand-alone reader can be downloaded. The proprietary output can be exported to text or Microsoft Word, PDF, Excel, and other formats. Alternatively, output can be captured as data (using the OMS command), as text, tab-delimited text, PDF, XLS, HTML, XML, SPSS dataset or a variety of graphic image formats (JPEG, PNG, BMP and EMF).
Working Procedure You always begin by defining a set of variables, and then you enter data for the variables to create a number of cases. For example, if you are doing an analysis of automobiles, each car in your study would be a case. The variables that define the cases could be things such as the year of manufacture, horsepower, and cubic inches of displacement. Each car in the study is defined as a single case, and each case is defined as a set of values assigned to the collection of variables. Every case has a value for each variable. Variables have types. That is, each variable is defined as containing a specific kind of number. For example, a scale variable is a numeric measure