data ware housing

data warehousing:

A data warehouse is a electronic storage of an Organization's historical data for the purpose of reporting, analysis and data mining or knowledge discovery. Other than that a data warehouse can also be used for the purpose of data integration, master data management etc. According to Bill Inmon, a datawarehouse should be subject-oriented, non-volatile, integrated and time-variant. A data warehouse helps to integrate data (see Data integration) and store them historically so that we can analyze different aspects of business including, performance analysis, trend, prediction etc. over a given time frame and use the result of our analysis to improve the efficiency of business processes.For a long time in the past and also even today, Data warehouses are built to facilitate reporting on different key business processes of an organization, known as KPI. Data warehouses also help to integrate data from different sources and show a single-point-of-truth values about the business measures. Data warehouse can be further used for data mining which helps trend prediction, forecasts, pattern recognition etc. Data miningData mining is the process of discovering actionable information from large sets of data. Data mining uses mathematical analysis to derive patterns and trends that exist in data. Typically, these patterns cannot be discovered by traditional data exploration because the relationships are too complex or because there is too much data. These patterns and trends can be collected and defined as a data mining model. Mining models can be applied to specific scenarios, such as:Forecasting: Estimating sales, predicting server loads or server downtimeRisk and probability: Choosing the best customers for targeted mailings, determining the probable break-even point for risk scenarios, assigning probabilities to diagnoses or other outcomesRecommendations: Determining which products are likely to be sold together, generating recommendationsFinding sequences: Analyzing customer selections in a shopping cart, predicting next likely eventsGrouping: Separating customers or events into cluster of related items, analyzing and predicting affinitiesBuilding a mining model is part of a larger process that includes everything from asking questions about the data and creating a model to answer those questions, to deploying the model into a working environment. This process can be defined by using the following six basic steps:Defining the Problem Preparing Data Exploring Data Building Models Exploring and Validating Models Deploying and Updating Models

architecture of data warehouse. The architecture for a data ware is indicated below. Before we proceed further, we should be clear about the concept of architecture. It only gives the major items that make up a data ware house. The size and complexity of each of these items depend on the actual size of the ware house itself, the specific requirements of the ware house and the actual details of implementation.

Before looking into the details of each of the managers we could get a broad idea about their functionality by mapping the processes that we studied in the previous chapter to the managers. The extracting and loading processes are taken care of by the load manager. The processes of cleanup and transformation of data as also of back up and archiving are the duties of the ware house manage, while the query manager, as the name implies is to take case of query management.

architecture of query managerQuery Manager is responsible for directing the queries to the suitable tables.By directing the queries to appropriate table the query request and response process is speed up.Query Manager is responsible for scheduling the execution of the queries posed by the user.Query Manager ArchitectureQuery Manager includes the following:The query redirection via C tool or RDBMS.Stored procedures.Query Management tool.Query Scheduling via C tool or RDBMS.Query Scheduling via third party Software.

multidimensional schemaMultidimensional structure is defined as a variation of the relational model that uses multidimensional structures to organize data and express the relationships between data.[8] The structure is broken into cubes and the cubes are able to store and access data within the confines of each cube. Each cell within a multidimensional structure contains aggregated data related to elements along each of its dimensions.[9] Even when data is manipulated it remains easy to access and continues to constitute a compact database format. The data still remains interrelated. Multidimensional structure is quite popular for analytical databases that use online analytical processing (OLAP) applications.[10] Analytical databases use these databases because of their ability to deliver answers to complex business queries swiftly. Data can be viewed from different angles, which gives a broader perspective of a problem unlike other models.[11]

types of partitioning A partition is a division of a logical database or its constituent elements into distinct independent parts. Database partitioning is normally done for manageability, performance or availability reasons.The partitioning can be done by either building separate smaller databases (each with its own tables, indices, and transaction logs), or by splitting selected elements, for example just one table.Horizontal partitioning (also see shard) involves putting different rows into different tables. Perhaps customers with ZIP codes less than 50000 are stored in CustomersEast, while customers with ZIP codes greater than or equal to 50000 are stored in CustomersWest. The two partition tables are then CustomersEast and CustomersWest, while a view with a union might be created over both of them to provide a complete view of all customers.Vertical partitioning involves creating tables with fewer columns and using additional tables to store the remaining columns.[1] Normalization also involves this splitting of columns across tables, but vertical partitioning goes beyond that and partitions columns even when already normalized. Different physical storage might be used to realize vertical partitioning as well; storing infrequently used or very wide columns on a different device, for example, is a method of vertical partitioning. Done explicitly or implicitly, this type of partitioning is called "row splitting" (the row is split by its columns). A common form of vertical partitioning is to split dynamic data (slow to find) from static data (fast to find) in a table where the dynamic data is not used as often as the static. Creating a view across the two newly created tables restores the original table with a performance penalty, however performance will increase when accessing the static data e.g. for statistical analysis.

design fact tablesIdentify a business process for analysis (like sales).Identify measures of facts (sales dollar), by asking questions like 'What number of XX are relevant for the business process?', replacing the XX with various options that make sense within the context of the business.Identify dimensions for facts (product dimension, location dimension, time dimension, organization dimension), by asking questions that make sense within the context of the business, like 'Analyse by XX', where XX is replaced with the subject to test.List the columns that describe each dimension (region name, branch name, business unit name).Determine the lowest level (granularity) of summary in a fact table (e.g. sales dollars).An alternative approach is the four step design process described in KimbalA fact table typically has two types of columns: those that contain numeric facts (often called measurements), and those that are foreign keys to dimension tables. A fact table contains either detail-level facts or facts that have been aggregated. Fact tables that contain aggregated facts are often called summary tables. A fact table usually contains facts with the same level of aggregation. Though most facts are additive, they can also be semi-additive or non-additive. Additive facts can be aggregated by simple arithmetical addition. A common example of this is sales. Non-additive facts cannot be added at all. An example of this is averages. Semi-additive facts can be aggregated along some of the dimensions and not along others. An example of this is inventory levels, where you cannot tell what a level means simply by looking at it.Creating a New Fact TableYou must define a fact table for each star schema. From a modeling standpoint, the primary key of the fact table is usually a composite key that is made up of all of its foreign keys.

horizontal partitioningHorizontal partitioning divides a table into multiple tables. Each table then contains the same number of columns, but fewer rows. For example, a table that contains 1 billion rows could be partitioned horizontally into 12 tables, with each smaller table representing one month of data for a specific year. Any queries requiring data for a specific month only reference the appropriate table.Determining how to partition the tables horizontally depends on how data is analyzed. You should partition the tables so that queries reference as few tables as possible. Otherwise, excessive UNION queries, used to merge the tables logically at query time, can affect performance. For more information about querying horizontally partitioned tables, see Scenarios for Using Views.Partitioning data horizontally based on age and use is common. For example, a table may contain data for the last five years, but only data from the current year is regularly accessed. In this case, you may consider partitioning the data into five tables, with each table containing data from only one year..

concept of aggregationAggregates are used in dimensional models of the data warehouse to produce dramatic positive effects on the time it takes to query large sets of data. At the simplest form an aggregate is a simple summary table that can be derived by performing a Group by SQL query. A more common use of aggregates is to take a dimension and change the granularity of this dimension. When changing the granularity of the dimension the fact table has to be partially summarized to fit the new grain of the new dimension, thus creating new dimensional and fact tables, fitting this new level of grain. Aggregates are sometimes referred to as pre-calculated summary data, since aggregations are usually precomputed, partially summarized data, that are stored in new aggregated tables. When facts are aggregated, it is either done by eliminating dimensionality or by associating the facts with a rolled up dimension. Rolled up dimensions should be shrunken versions of the dimensions associated with the granular base facts. This way, the aggregated dimension tables should conform to the base dimension tables.[1] So the reason why aggregates can make such a dramatic increase in the performance of the data warehouse is the reduction of the number of rows to be accessed when responding to a query.[2]Ralph Kimball, who is widely regarded as one of the original architects of data warehousing, says:[3]The single most dramatic way to affect performance in a large data warehouse is to provide a proper set of aggregate (summary) records that coexist with the primary base records. Aggregates can have a very significant effect on performance, in some cases speeding queries by a factor of one hundred or even one thousand. No other means exist to harvest such spectacular gains.Having aggregates and atomic data increases the complexity of the dimensional model. This complexity should be transparent to the users of the data warehouse, thus when a request is made, the data warehouse should return data from the table with the correct grain. So when requests to the data warehouse are made, aggregate navigator functionality should be implemented, to help determine the correct table with the correct grain. The number of possible aggregations is determined by every possible combination of dimension granularities. Since it would produce a lot of overhead to build all possible aggregations, it is a good idea to choose a subset of tables on which to make aggregations. The best way to choose this subset and decide which aggregations to build is to monitor queries and design aggregations to match query patterns.[4

design steps for summary tableSummary tables store data that is aggregated and/or summarized for performance reasons (i.e., to improve the performance of business queries). Most business queries (i.e., approximately 80%) will run against summary tables.Data is aggregated by combining multiple concepts together and/or combining large amounts of detailed data together.Most business queries analyze a summarization or aggregation of data (i.e., facts) across one or more dimensions. Therefore, a summary table may use multiple dimensions. For example, a table that analyzes accounts by region by customer by service by month uses four dimensions.Design ConsiderationsThe main objective when designing summary tables is to minimize the amount of data being accessed and the number of tables being joined. This is done by storing intermediate query results, such as:* summaries of large amounts of data (e.g., summing product inventory by quarter),* combinations of multiple concepts (e.g., sales by customer by market),* reference data (e.g., product description).

organizational of the data martA data mart is the access layer of the data warehouse environment that is used to get data out to the users. The data mart is a subset of the data warehouse that is usually oriented to a specific business line or team. Data marts are small slices of the data warehouse. Whereas data warehouses have an enterprise-wide depth, the information in data marts pertains to a single department. In some deployments, each department or business unit is considered the owner of its data mart including all the hardware, software and data.[1] This enables each department to use, manipulate and develop their data any way they see fit; without altering information inside other data marts or the data warehouse. In other deployments where conformed dimensions are used, this business unit ownership will not hold true for shared dimensions like customer, product, etc.The reasons why organizations are building data warehouses and data marts are because the information in the database is not organized in a way that makes it easy for organizations to find what they need. Also complicated queries might take a long time to answer what people want to know since the database systems are designed to process millions of transactions per day. Transactional database are designed to be updated, however, data warehouses or marts are read only. Data warehouses are designed to access large groups of related records.Data marts improve end-user response time by allowing users to have access to the specific type of data they need to view most often by providing the data in a way that supports the collective view of a group of users.A data mart is basically a condensed and more focused version of a data warehouse that reflects the regulations and process specifications of each business unit within an organization. Each data mart is dedicated to a specific business function or region. This subset of data may span across many or all of an enterprises functional subject areas. It is common for multiple data marts to be used in order to serve the needs of each individual business unit (different data marts can be used to obtain specific information for various enterprise departments, such as accounting, marketing, sales, etc.).The related term spreadmart is a derogatory label describing the situation that occurs when one or more business analysts develop a system of linked spreadsheets to perform a business analysis, then grow it to a size and degree of complexity that makes it nearly impossible to maintain.The primary use for a data mart is business intelligence (BI) applications. BI is used to gather, store, access and analyze data. The data mart can be used by smaller businesses to utilize the data they have accumulated. A data mart can be less expensive than implementing a data warehouse, thus making it more practical for the small business. A data mart can also be set up in much less time than a data warehouse, being able to be set up in less than 90 days. Since most small businesses only have use for a small number of BI applications, the low cost and quick set up of the data mart makes it a suitable method for storing data.[2]

data management and query generationThe official definition provided by DAMA International, the professional organization for those in the data management profession, is: "Data Resource Management is the development and execution of architectures, policies, practices and procedures that properly manage the full data lifecycle needs of an enterprise." {{DAMA International}} This definition is fairly broad and encompasses a number of professions which may not have direct technical contact with lower-level aspects of data management, such as relational database management.Alternatively, the definition provided in the DAMA Data Management Body of Knowledge (DAMA-DMBOK) is: "Data management is the development, execution and supervision of plans, policies, programs and practices that control, protect, deliver and enhance the value of data and information assets."[1]The concept of "Data Management" arose in the 1980s as technology moved from sequential processing (first cards, then tape) to random access processing. Since it was now technically possible to store a single fact in a single place and access that using random access disk, those suggesting that "Data Management" was more important than "Process Management" used arguments such as "a customer's home address is stored in 75 (or some other large number) places in our computer systems." During this period, random access processing was not competitively fast, so those suggesting "Process Management" was more important than "Data Management" used batch processing time as their primary argument. As applications moved more and more into real-time, interactive applications, it became obvious to most practitioners that both management processes were important. If the data was not well defined, the data would be mis-used in applications. If the process wasn't well defined, it was impossible to meet user needs.A given database management system may offer one or more mechanisms for returning the plan for a given query. Some packages feature tools which will generate a graphical representation of a query plan. Other tools allow a special mode to be set on the connection to cause the DBMS to return a textual description of the query plan. Another mechanism for retrieving the query plan involves querying a virtual database table after executing the query to be examined. In Oracle, for instance, this can be achieved using the EXPLAIN PLAN statement.

kind of data can be minedGenerally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.Different kind of data that can be mined are listed below:-i). Flat files:Flat files are actually the most common data source for data mining algorithms, especially at the research level.ii). Relational Databases:A relational database consists of a set of tables containing either values of entity attributes, or values of attributes from entity relationships.iii). Data Warehouses:A data warehouse as a storehouse, is a repository of data collected from multiple data sources (often heterogeneous) and is intended to be used as a whole under the same unified schema.iv). Multimedia Databases:Multimedia databases include video, images, audio and text media. They can be stored on extended object-relational or object-oriented databases, or simply on a file system.v). Spatial Databases:Spatial databases are databases that in addition to usual data, store geographical information like maps, and global or regional positioning.vi). Time-Series Databases:Time-series databases contain time related data such stock market data or logged activities. These databases usually have a continuous flow of new data coming in, which sometimes causes the need for a challenging real time analysis.vii). World Wide Web:The World Wide Web is the most heterogeneous and dynamic repository available. A very large number of authors and publishers are continuously contributing to its growth and metamorphosis and a massive number of users are accessing its resources daily.

categorize data mining systemsThere are many data mining systems available or being developed. Some are specialized systems dedicated to a given data source or are confined to limited data mining functionalities, other are more versatile and comprehensive. Data mining systems can be categorized according to various criteria among other classification are the following:a)Classification according to the type of data source mined: this classification categorizes data mining systems according to the type of data handled such as spatial data, multimedia data, time-series data, text data, World Wide Web, etc.b)Classification according to the data model drawn on: this classification categorizes data mining systems based on the data model involved such as relational database, object-oriented database, data warehouse, transactional, etc.c)Classification according to the king of knowledge discovered: this classification categorizes data mining systems based on the kind of knowledge discovered or data mining functionalities, such as characterization, discrimination, association, classification, clustering, etc. Some systems tend to be comprehensive systems offering several data mining functionalities together.d)Classification according to mining techniques used: Data mining systems employ and provide different techniques. This classification categorizes data mining systems according to the data analysis approach used such as machine learning, neural networks, genetic algorithms, statistics, visualization, database oriented or data warehouse-oriented, etc.

Data mining query languageA data mining language helps in effective knowledge discovery from the data mining systems. Designing a comprehensive data mining language is challenging because data mining covers a wide spectrum of tasks from data characterization to mining association rules, data classification and evolution analysis. Each task has different requirements. The design of an effective data mining query language requires adeep understanding of the power, limitation and underlying mechanism of the various kinds of data mining tasks.Decision trees for data miningDecision trees are powerful and popular tools for classification and prediction. The attractiveness of tree-based methods is due in large part to the fact that, it is simple and decision trees represent rules. Rules can readily be expressed so that we humans can understand them or in a database access language like SQL so that records falling into a particular category may be retrieved.Regression modelsRegression is a data mining (machine learning) technique used to fit an equation to a dataset. The simplest form of regression, linear regression, uses the formula of a straight line (y = mx + b) and determines the appropriate values for m and b to predict the value of y based upon a given value of x. Advanced techniques, such as multiple regression, allow the use of more than one input variable and allow for the fitting of more complex models, such as a quadratic equation.architecture of load managerLoad ManagerThis Component performs the operations required to extract and load process.The size and complexity of load manager varies between specific solutions from data warehouse to data warehouse.Load Manager ArchitectureThe load manager performs the following functions:Extract the data from source system.Fast Load the extracted data into temporary data store.Perform simple transformations into structure similar to the one in the data warehouseExtract Data from SourceThe data is extracted from the operational databases or the external information providers. Gateways is the application programs that are used to extract data. It is supported by underlying DBMS and allows client program to generate SQL to be executed at a server. Open Database Connection( ODBC), Java Database Connection (JDBC), are examples of gateway.Fast LoadIn order to minimize the total load window the data need to be loaded into the warehouse in the fastest possible time.The transformations affects the speed of data processing.It is more effective to load the data into relational database prior to applying transformations and checks.Gateway technology proves to be not suitable, since they tend not be performant when large data volumes are involved.Simple TransformationsWhile loading it may be required to perform simple transformations. After this has been completed we are in position to do the complex checks. Suppose we are loading the EPOS sales transaction we need to perform the following checks:Strip out all the columns that are not required within the warehouse.Convert all the values to required data types.

6. a) Give the syntax for task relevant data specification. 7The first step in defining a data mining task is the specification of the task-relevant data, that is, the data on which mining is to be performed. This involves specifying the database and tables or data warehouse containing the relevant data, conditions for selecting the relevant data, the relevant attributes or dimensions for exploration, and instructions regarding the ordering or grouping of the data retrieved. DMQL provides clauses for the clauses for the specification of such information, as follows:-i). use database (database_name) or use data warehouse (data_warehouse_name):The use clause directs the mining task to the database or data warehouse specified.ii). from (relation(s)/cube(s)) [where(condition)]:The from and where clauses respectively specify the database tables or data cubes involved, and the conditions defining the data to be retrieved.iii). in relevance to (attribute_or_dimension_list):This clause lists the attributes or dimensions for exploration.iv). order by (order_list):The order by clause specifies the sorting order of the task relevant data.v). group by (grouping_list):the group by clause specifies criteria for grouping the data.vi). having (conditions):The having cluase specifies the condition by which groups of data are considered relevan

b) Give the various functional components for designing GUI based on datamining query language. 8A data mining query language provides necessary primitives that allow users to communicate with data mining systems. But novice users may find data mining query language difficult to use and the syntax difficult to remember. Instead , user may prefer to communicate with data mining systems through a graphical user interface (GUI). In relational database technology , SQL serves as a standard core language for relational systems , on top of which GUIs can easily be designed. Similarly, a data mining query language may serve as a core language for data mining system implementations, providing a basis for the development of GUI for effective data mining.A data mining GUI may consist of the following functionalcomponents:-a) Data collection and data mining query composition -This component allows the user to specify task-relevant data sets and to compose data mining queries. It is similar to GUIs used for the specification of relational queries.b) Presentation of discovered patterns This component allows the display of the discovered patterns in various forms, including tables, graphs, charts, curves and other visualization techniques.c) Hierarchy specification and manipulation -This component allows for concept hierarchy specification , either manually by the user or automatically. In addition , this component should allow concept hierarchies to be modified by the user or adjusted automatically based on a given data set distribution.d) Manipulation of data mining primitives This component may allow the dynamic adjustment of data mining thresholds, as well as the selection, display and modification of concept hierarchies. It may also allow the modification of previous data mining queries or conditions.e) Interactive multilevel mining This component should allow roll-up or drill-down operations on discovered patterns.f) Other miscellaneous information This component may include on-line help manuals, indexed search , debugging and other interactive graphical facilities.

7. a) Give the meaning for the following : 12i) no-coupling No-coupling: in this architecture, data mining system does not utilize any functionality of a database or data warehouse system. A no-coupling data mining system retrieves data from a particular data sources such as file system, processes data using major data mining algorithms and stores results into file system. The no-coupling data mining architecture does not take any advantages of database or data warehouse that is already very efficient in organizing, storing, accessing and retrieving data. The no-coupling architecture is considered a poor architecture for data mining system however it is used for simple data mining processes.ii) loose-coupling Loose Coupling: in this architecture, data mining system uses database or data warehouse for data retrieval. In loose coupling data mining architecture, data mining system retrieves data from database or data warehouse, processes data using data mining algorithms and stores the result in those systems. This architecture is mainly for memory-based data mining system that does not require high scalability and high performance.iii) semitight couplingSemi-tight Coupling: in semi-tight coupling data mining architecture, beside linking to database or data warehouse system, data mining system uses several features of database or data warehouse systems to perform some data mining tasks including sorting, indexing, aggregationetc. In this architecture, some intermediate result can be stored in database or data warehouse system for better performance.iv) tight coupling, Tight Coupling: in tight coupling data mining architecture, database or data warehouse is treated as an information retrieval component of data mining system using integration. All the features of database or data warehouse are used to perform data mining tasks. This architecture provides system scalability, high performance and integrated information.

iii) K-means Algorithm.k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.Given a set of observations (x1, x2, , xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k sets (k n) S={S1,S2,,Sk} so as to minimize the within-cluster sum of squares (WCSS):

where i is the mean of points in Si.

4) a) Explain in detail the design of star flake schema. 7 MarksA starflake schema is a combination of a star schema and a snowflake schema. Starflake schemas are snowflake schemas where only some of the dimension tables have been denormalized.Starflake schemas aim to leverage the benefits of both star schemas and snowflake schemas. The hierarchies of star schemas are denormalized, while the hierarchies of snowflake schemas are normalized.Starflake schemas are normalized to remove any redundancies in the dimensions. To normalize the schema, the shared dimensional hierarchies are placed in outriggers.

b) Explain in detail Hardware partitioning. 8 MarksData warehouses often contain very large tables and require techniques both for managing these large tables and for providing good query performance across them. An important tool for achieving this, as well as enhancing data access and improving overall application performance is partitioning.Partitioning offers support for very large tables and indexes by letting you decompose them into smaller and more manageable pieces called partitions. This support is especially important for applications that access tables and indexes with millions of rows and many gigabytes of data. Partitioned tables and indexes facilitate administrative operations by enabling these operations to work on subsets of data. For example, you can add a new partition, organize an existing partition, or drop a partition with minimal to zero interruption to a read-only application.Partitioning can help you tune SQL statements to avoid unnecessary index and table scans (using partition pruning). It also enables you to improve the performance of massive join operations when large amounts of data (for example, several million rows) are joined together by using partition-wise joins. Finally, partitioning data greatly improves manageability of very large databases and dramatically reduces the time required for administrative tasks such as backup and restore.Granularity in a partitioning scheme can be easily changed by splitting or merging partitions. Thus, if a table's data is skewed to fill some partitions more than others, the ones that contain more data can be split to achieve a more even distribution. Partitioning also enables you to swap partitions with a table. By being able to easily add, remove, or swap a large amount of data quickly, swapping can be used to keep a large amount of data that is being loaded inaccessible until loading is completed, or can be used as a way to stage data between different phases of use. Some examples are current day's transactions or online archives.A good starting point for considering partitioning strategies is to use the partitioning advice within the SQL Access Advisor, part of the Tuning Pack. The SQL Access Advisor offers both graphical and command-line interfaces.

b) Explain in detail back propagation neural network. 8 MarksOne of the most popular NN algorithms is back propagation algorithm. Rojas [2005] claimed that BPalgorithm could be broken down to four main steps. After choosing the weights of the network randomly, theback propagation algorithm is used to compute the necessary corrections. The algorithm can be decomposedin the following four steps:i) Feed-forward computationii) Back propagation to the output layeriii) Back propagation to the hidden layeriv) Weight updatesThe algorithm is stopped when the value of the error function has become su_ciently small.This is very rough and basic formula for BP algorithm. There are some variation proposed by other scientistbut Rojas de_nition seem to be quite accurate and easy to follow. The last step, weight updates is happeningthrough out the algorithm.Clustering : Clustering is the process of making group of abstract objects into classes of similar objects.Points to RememberA cluster of data objects can be treated as a one group.While doing the cluster analysis, we first partition the set of data into groups based on data similarity and then assign the label to the groups.The main advantage of Clustering over classification is that, It is adaptable to changes and help single out useful features that distinguished different groups.

iii) KDD environment.It is customary in the computer industry to formulate rules of thumb that help information technology (IT) specialists to apply new developments. In setting up a reliable data mining environment we may follow the guidelines so that KDD system may work in a manner we desire.i).Support extremely large data setsii).Support hybrid learningiii).Establish a data warehouseiv).Introduce data cleaning facilitiesv).Facilitate working with dynamic codingvi).Integrate with decision support systemvii).Choose extendible architectureviii).Support heterogeneous databasesix).Introduce client/server architecturex).Introduce cache optimization

important function of Warehouse Manager:i) Analyze the data to confirm data consistency and data integrity .ii) Transform and merge the source data from the temporary data storage into the ware house.iii) Create indexes, cross references, partition views etc.,.iv) Check for normalizations.v) Generate new aggregations, if needed.vi) Update all existing aggregationsvii) Create backups of data.viii) Archive the data that needs to be archived.Invertical partitioning, some columns are stored in one partition and certain other columns of the same row in a different partition. This can again be achieved either by normalization or row splitting. We will look into their relative trade offs.

data ware housing

Documents