mc0088 internal assignment (smu)

MC0088- DATA WAREHOUSING & DATA MINING

Que.1 Differentiate between Data Mining and Data Warehousing? Ans: - Data Mining: - Data Mining: A hot buzzword for a class of database applications that look for hidden patterns in a group of data. For example, data mining software can help retail companies find customers with common interests. The term is commonly misused to describe software that presents data in new ways. True data mining software doesn't just change the presentation, but actually discovers previously unknown relationships among the data. Data mining consists of many up-to-date techniques such as classification (decision trees, native Bayes classifier, k-nearest neighbor, and neural networks), clustering (k-means, hierarchical clustering, and density-based clustering), association (one-dimensional, multidimensional, multilevel association, constraint-based association). Many years of practice show that data mining is a process, and its successful application requires data preprocessing (dimensionality reduction, cleaning, noise/outlier removal), post processing (understand ability, summary, presentation), good understanding of problem domains and domain expertise. Data Warehousing: - The construction of data warehouse, which involves data cleaning and data integration, can be viewed as an important preprocessing step for data mining. Moreover, data warehouses provide on-line analytical processing (OLAP) tools for the interactive analysis of multidimensional data of varied granularities, which facilitate effective data mining. Furthermore, many other data mining functions such as classification, prediction, association and clustering can be integrated with OLAP operation to enhance interactive mining of knowledge at multiple levels of abstraction. Hence, the data warehouse has become an increasingly important platform for data analysis and online analytical processing and will provide an effective platform for data mining. Therefore, prior to presenting a systematic coverage of data mining technology in the remainder of this book, we devote this unit to an overview of data warehouse technology. Such an overview is essential for understanding data mining technology. Data warehouses have been defined in many ways, making it difficult to formulate a rigorous definition. A data warehouse refers to a database that is maintained separately from an organization’s operational databases. Data warehouse systems allow for the integration of a variety of application systems. Data warehousing is defined as a process of centralized data management and retrieval. Data warehousing, like data mining, is a relatively new term although the concept itself has been around for years.

Que.2 Describe the key features of a Data Warehouse? Ans: - According to W. H. Inmon, a leading architect in the construction of data warehouse systems, “A data warehouse is a subject – oriented, integrated, and time – variant, and nonvolatile collection of data in support of management’s decision making process”.

Key features of a Data Warehouse

1) Subject – oriented 2) Integrated 3) Time – variant: 4) Nonvolatile

Subject – oriented: - A data warehouse is organized around major subjects, such as customer, supplier, product, and sales. Rather than concentrating on the day-to-day operation and transaction processing of an organization, a data warehouse focuses on the modeling and analysis of data for decision makers. Hence, data warehouses typically provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. Integrated: - A data warehouse is usually constructed by integrating multiple heterogeneous sources, such as relational databases, flat files, and on – line transaction records. Data cleaning and data integration techniques are applied to ensure consistency in naming conventions, encoding structures, attribute measures, and so on. Time – variant: - Data are stored to provide information from a historical perspective (e.g., the past 5 – 10 years). Every key structure in the data warehouse contains, either implicitly or explicitly, an element of time. Nonvolatile: - A data warehouse is always a physically separate store of data transformed from the application data found in the operational environment. Due to this separation, a data warehouse does not require transaction processing, recovery, and concurrency control mechanisms. It usually requires only two operations in data accessing: initial loading of data and access of data. The traditional database approach to heterogeneous database integration is to build wrappers and integrators (or mediators) on top of multiple, heterogeneous databases (examples include IBM Data Joiner and Informix Data Blade). When a query is posed to a client site, a metadata dictionary is used to translate the query into queries appropriate for the individual heterogeneous sites involved. These queries are then mapped and sent to local query processors. The results returned form the different sites are integrated into a global answer set.

Que. 3 Differentiate between Data Integration and Transformation? Ans: - Data Integration: - Data Integration is one of the steps of Data Preprocessing that involves combining data residing in different sources and providing users with a unified view of these data It does merging data from multiple data stores (data sources) like as under : -

1) Data Migration 2) Data Synchronization 3) ETL 4) Business Intelligence 5) Master Data Management

Data Migration: - Data Migration is the process of transferring data from one system to another while changing the storage, database or application. Data Synchronization: - Data Synchronization is a process of establishing consistency among systems and subsequent continuous updates to maintain consistency. ETL: - ETL comes from Data Warehousing and stands for Extract-Transform-Load. ETL covers a process of how the data are loaded from the source system to the data warehouse. Business Intelligence: - Business Intelligence (BI) is a set of tools supporting the transformation of raw data into useful information which can support decision making. Master Data Management: - Master Data Management (MDM) represents a set of tools and processes used by an enterprise to consistently manage their non-transactional data.

Transformation Data transformation is the process of converting data from one format (e.g. a database file, XML document, or Excel sheet) to another. Because data often resides in different locations and formats across the enterprise, data transformation is necessary to ensure data from one application or database is intelligible to other applications and databases, a critical feature for applications integration. In a typical scenario where information needs to be shared, data is extracted from the source application or data warehouse, transformed into another format, and then loaded into the target location. Extraction, transformation, and loading (together known as ETL) are the central processes of data integration. Depending on the nature of the integration scenario, data may need to be merged, aggregated, enriched, summarized, or filtered. The first step of data transformation is data mapping. Data mapping determines the relationship between the data elements of two applications and establishes instructions for how the data from the source application is transformed before it is loaded into the target application. In other words, data mapping produces the critical metadata that is needed before the actual data conversion takes place.

Que. 4 Differentiate between database management systems (DBMS) and data mining? Ans: - Database Management System (DBMS) is the software that manages data on physical storage devices. Data Mining: - Data mining is the process of discovering relationships among data in the database.

Area DBMS Data mining

Task Extraction of detailed and summary data

Knowledge discovery of hidden patterns and insights

Type of result Information Insight and Prediction

Method Deduction (Ask the question, verify the data)

Induction (Build the model, apply it to new data, get the result)

Example question Who purchased mutual funds in the last 3 years?

Who will buy a mutual fund in the next 6 months and why?

Data mining is concerned with finding hidden relationships present in business data to allow businesses to make predictions for future use. It is the process of data-driven extraction of not so obvious but useful information from large databases. The aim of data mining is to extract implicit, previously unknown and potentially useful (or actionable) patterns from data. Data mining consists of many up-to-date techniques such as classification (decision trees, naïve bays classifier, k -nearest neighbor, and neural networks), clustering (k-means, hierarchical clustering, and density-based clustering), association (one-dimensional, multidimensional, multilevel association, constraint-based association). Data warehousing is defined as a process of centralized data management and retrieval. Data warehouse is an enabled relational database system designed to support very large databases (VLDB) at a significantly higher level of performance and manageability. Data warehouse is an environment, not a product. It is an architectural construct of information that is hard to accessory present in traditional operational data stores

Que. 5 Differentiate between K-means and Hierarchical clustering? Ans: - K-means clustering The k-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest. The center is the average of all the points in the cluster — that is, its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster. Example: The data set has three dimensions and the cluster has two points: X = (x1,x2,x3) and Y = (y1,y2,y3). Then the centroid Z becomes Z = (z1,z2,z3), where The algorithm steps are as under: - Choose the number of clusters, k. Randomly generate k clusters and determine the cluster centers, or directly generate k random points as cluster centers. Assign each point to the nearest cluster center, where "nearest" is defined with respect to one of the distance measures discussed above. Recomputed the new cluster centers. Repeat the two previous steps until some convergence criterion is met (usually that the assignment hasn't changed). The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets. Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments. Hierarchical clustering: - Hierarchical clustering creates a hierarchy of clusters which may be represented in a tree structure called a dendrogram. The root of the tree consists of a single cluster containing all observations, and the leaves correspond to individual observations. Algorithms for hierarchical clustering are generally either agglomerative, in which one starts at the leaves and successively merges clusters together; or divisive, in which one starts at the root and recursively splits the clusters. Any non-negative-valued function may be used as a measure of similarity between pairs of observations. The choice of which clusters to merge or split is determined by a linkage criterion, which is a function of the pair wise distances between observations. Cutting the tree at a given height will give a clustering at a selected precision. In the following example, cutting after the second row will yield clusters {a} {b c} {d e} {f}. Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, with a smaller number of larger clusters. This method builds the hierarchy from the individual elements by progressively merging clusters. In our example, we have six elements {a} {b} {c} {d} {e} and {f}. The first step is to determine which elements to merge in a cluster.

Que. 6 Differentiate between Web content mining and Web usage mining? Ans: - Web Content Mining: - Web content mining targets the knowledge discovery, in which the main objects are the traditional collections of multimedia documents such as images, video, and audio, which are embedded in or linked to the web pages. It is also quite different from Data mining because Web data are mainly semi-structured and/or unstructured, while Data mining deals primarily with structured data. Web content mining is also different from Text mining because of the semi-structure nature of the Web, while Text mining focuses on unstructured texts. Web content mining thus requires creative applications of Data mining and / or Text mining techniques and also its own unique approaches. In the past few years, there was a rapid expansion of activities in the Web content mining area. This is not surprising because of the phenomenal growth of the Web contents and significant economic benefit of such mining. However, due to the heterogeneity and the lack of structure of Web data, automated discovery of targeted or unexpected knowledge information still present many challenging research problems. Web content mining could be differentiated from two points of view:

1) Agent-based approach 2) Database approach.

The first approach aims on improving the information finding and filtering. The second approach aims on modeling the data on the. Web into more structured form in order to apply standard database querying mechanism and data mining applications to analyze it Web Usage Mining: - Web Usage Mining focuses on techniques that could predict the behavior of users while they are interacting with the WWW. Web usage mining, discover user navigation patterns from web data, tries to discover the useful information from the second array data derived from the interactions of the users while surfing on the Web. There are several available research projects and commercial tools that analyze those patterns for different purposes. The insight knowledge could be utilized in personalization, system improvement, site modification, business intelligence and usage characterization. The only information left behind by many users visiting a Web site is the path through the pages they have accessed. Most of the Web information retrieval tools only use the textual information, while they ignore the link information that could be very valuable. In general, there are mainly four kinds of data mining techniques applied to the web mining domain to discover the user navigation pattern:

1) Association Rule mining 2) Sequential pattern 3) Clustering 4) Classification

mc0088 internal assignment (smu)

Education