Data mining with SAS Enterprise Miner.doc.doc

Download Data mining with SAS Enterprise Miner.doc.doc

Post on 10-May-2015




5 download

Embed Size (px)


<ul><li>1.DATA MINING USINGSAS ENTERPRISE MINERMAHESH BOMMIREDDY. CHAITHANYA KADIYALA. </li></ul><p>2. ABSTRACT : Data mining combines data analysis techniques with high-end technology for use within a Process. The primary goal of data mining is to develop usable knowledge regarding future events. This paper defines the steps in the data mining process, explains the importance of the steps, and shows how the steps were used in two case studies involving fraud detection.1The steps in the data mining process are: problem definition data collection and enhancement modeling strategies training, validation, and testing of models analyzing results modeling iterations implementing results.What Is the Value of Data Mining?The goal of data mining is to produce new knowledge that decision-makers can act upon. It does this by using sophisticated techniques such as artificial intelligence to build a model of the real world based on data collected from a variety of sources including corporate transactions, customer histories and demographics, and from external sources such as credit bureaus. This model produces patterns in the information that can support decision making and predict new business opportunities. Data mining applications reach across industries and business functions. For example, telecommunications, stock exchanges, and credit card and insurance companies use data mining to detect fraudulent use of their services; the medical industry uses data mining to predict the effectiveness of surgical procedures, medical tests, and medications; and retailers use data mining to assess the effectiveness of coupons and special events. For companies who use data mining effectively, the payoffs can be huge. But often, the data mining process can be unwieldy and inefficient. Many of the so-called data mining solutions currently available on the market today either don't integrate well, aren't scalable, or are limited to one or two modeling techniques or algorithms. As a result, highly trained quantitative experts spend more time trying to access, prepare, and manipulate data from disparate sources and less time modeling data and applying their expertise to solve business problems. And the data mining challenge is compounded even further as the amount of data and complexity of the business problem increase. Enter the SAS Data Mining Solution. 3. The SAS Data Mining Solution SAS Institute, the acknowledged industry leader in analytical and decision support solutions, offers a comprehensive data mining solution that includes software and services that allow you to explore large quantities of data and discover relationships and patterns that lead to proactive decision making. The SAS Data Mining Solution provides business technologists and quantitative experts with an easily maintained method for helping their organizations achieve a competitive advantage.SAS Data Mining helps with: Customer retention (keeping existing customers) Customer acquisition (finding new customers) Cross selling (selling customers more products based on what they have already bought) Upgrading (selling customers a higher level of service or product, such as a gold credit card versus a regular credit card) Fraud detection (determining if a particular transaction is out of the normal range of a person's activity and flagging that transaction for verification) Market-basket analysis (determining what combinations of products are purchased at a given time)Enterprise Miner SAS Institute's enhanced data mining software, offers an integrated environment for businesses that need to conduct comprehensive analyses of customer data. Enterprise Miner combines a rich suite of integrated data mining tools with unprecedented ease-of-use, empowering users to explore and exploit corporate data for strategic business advantage, In a single environment, Enterprise Miner provides all the tools needed to match robust data mining techniques to specific business problems, regardless of the amount or source of the data, or complexity of the business problem. SAS Data Mining Solution Components Enterprise Miner Customer Relationship ManagementA Map to the SAS Solution for Data Mining Data mining is a process, not just a series of statistical analyses. Simply applying disparate software tools to a data-mining project can take one only so far. Instead, what is needed to plan, implement, and successfully refine a data mining project is an integrated software solutionone that encompasses all steps of the process beginning with the sampling of data, through sophisticated data analyses and modeling, to the dissemination of the resulting business-critical information. In addition, the ideal solution should be intuitive and flexible enough that users with different degrees of statistical expertise can understand and use it. 4. To accomplish all this, the data mining solution must provide Advanced, yet easy-to-use, statistical analyses and reporting techniques, a guiding, yet flexible, methodology &amp; Client / server enablement.SAS Enterprise Miner software is that solution. It synthesizes the world-renowned statistical analysis and reporting system of SAS Institute with an easy-to-use graphical user interface (GUI) that can be understood and used by business analysts as well as quantitative experts. The components of the GUI can be used to implement a data mining methodology developed by the Institute. However, the methodology does not dictate steps within a project. Instead, the SAS data mining methodology provides a logical framework that enables business analysts and quantitative experts to meet the goals of their data mining projects by selecting components of the GUI as needed. The GUI also contains components that help administrators set up and maintain the client/server deployment of enterprise Miner software. In addition, as a software solution from SAS Institute, Enterprise Miner fully integrates with the rest of the SAS System including the award-winning SAS/Warehouse Administrator software, the SAS solution for online analytical processing (OLAP), and SAS/IntrNet software, which enables applications deployment via intranets and the World Wide Web.This paper discusses three main features of Enterprise Miner softwarethe GUI, the SEMMA methodology, and client/server enablementand maps the components of the solution to those features.The Graphical User InterfaceEnterprise Miner employs a single GUI to give users all the functionality needed to uncover valuable information hidden in their volumes of data. With one point-and-click interface, users can perform the entire data mining process from sampling of data, through exploration and modification, to modeling, assessment, and scoring of new data for use in making business decisions.Designed for a Range of UsersThe GUI is designed with two groups of users in mind. Business analysts, who may have minimal statistical expertise, can quickly and easily navigate through the data mining process. Quantitative experts, who may want to explore the details, can access and fine-tune the underlying analytical processes. The GUI uses familiar desktop objects such as tool bars, menus, windows, and dialog tabs to equip both groups with a full range of data mining tools. Enterprise Miner is packaged into a central application workspace window named SAS Enterprise Miner: Enterprise Miner Integrated User Interface 5. The primary components of the SAS Enterprise Miner Window include the following: Project Navigator - used for managing projects and diagrams, add tools such as analysis nodes to the diagram workspace, and view HTML reports that are created by the Reporter node. Diagram Workspace - used for building, editing, and running process flow diagrams. (PFDs) Tools Bar - contains a subset of the Enterprise Miner tools such as Nodes, which are commonly used to build PFDs in the diagram workspace. You can add or delete tools from the Tools Bars. Nodes a collection of icons that enable you to perform steps in the data mining process such as data access analysis, and reporting. Message and indicator panels across the bottom of the display. With these easy-to-use tools, you can map out your entire data-mining project, launch individual functions, and modify PFDs simply by pointing and clicking. Tools sub-tab from the Project NavigatorYou can select the Tools sub-tab from the Project Navigator at any time. This sub-tab functions as a palette which displays the data mining nodes that are available for constructing PFDs. Using the mouse, you can drag and drop nodes from the Tools sub-tab onto the diagram workspace and connect the nodes in the desired process flow. The arrows indicate the direction of the data flow.Common Features Among Nodes The nodes of Enterprise Miner software have a uniform look and feel. For example, the dialog tabs enable you to quickly access the appropriate options; the common node functionality enables you to learn the usage of the nodes quickly; and the Results Browser, which is available in many of the nodes, enables you to view the results of running the process flow diagrams.The SEMMA Methodology One of the keys to the effectiveness of Enterprise Miner is the fact that the GUI makes it easy for business analysts as well as quantitative experts to organize data mining projects into a logical framework. The visible representation of this framework is a PFD. It graphically illustrates the steps taken to complete an individual data-mining project. Sample, Explore, Modify, Model, Assess In order to be applied successfully, the SAS data mining solution must be viewed as a process rather than a set of tools. SEMMA refers to a methodology that clarifies this process. Beginning with a statistically representative sample of your data, SEMMA makes it easy to apply exploratory statistical and visualization techniques, select and transform the most significant predictive variables, model the variables to predict outcomes, and confirm a model's accuracy. Here's a quick look at each step in the SEMMA process: Sample your data by extracting a portion of a large data set big enough to contain the significant information, yet small enough to manipulate quickly. 6. Explore your data by searching for unanticipated trends and anomalies in order to gain understanding and ideas. Modify your data by creating, selecting, and transforming the variables to focus the model selection process. Model your data by allowing the software to search automatically for a combination of data that reliably predicts a desired outcome. Assess your data by evaluating the usefulness and reliability of the findings from the data mining process. By assessing the results gained from each stage of the SEMMA process, you can determine how to model new questions raised by the previous results, and thus proceed back to the exploration phase for additional refinement of the data. The SAS data mining solution integrates everything you need for discovery at each stage of the SEMMA process: Data storage, OLAP, data visualization, decision trees, forecasting, neural networks, and other analytical tools. These rule-based data mining tools indicate patterns or exceptions, and mimic human abilities for comprehending spatial geographical, and visual information sources. And an intuitive, graphical interface is customizable, so you don't need to know how these tools work in order to use them. Complex mining techniques are carried out in a totally code-free environment, allowing you to concentrate on the visualization of the data, discovery of new patterns, and new questions. Logical Superstructure via the GUI In addition, a larger, more general, framework for staging data mining projects exists. This larger framework for data mining is the SEMMA methodology as defined by SAS Institute. SEMMA is simply an acronym for Sample, Explore, Modify, Model, and Assess. However, this logical 7. super-structure provides users with a powerful and comprehensive method in which individual data mining projects can be developed and maintained. Not all data mining projects will need to follow each step of the SEMMA methodologythe tools in Enterprise Miner software give users the freedom to deviate from the process to meet their needsbut the methodology does give users a scientific, structured way of conceptualizing, creating, and evaluating data mining projects. You can use the nodes of Enterprise Miner to follow the steps of the SEMMA methodology; the methodology is a convenient and productive superstructure in which to view the nodes logically and graphically. In addition, part of the strength of Enterprise Miner software comes from the fact that the relationship between the nodes and the methodology is flexible. The relationship is flexible in that you can build PFDs to fit particular data mining requirements. You are not constrained by the GUI or the SEMMA methodology. For example, in many data mining projects, you may want to repeat parts of SEMMA by exploring data and plotting data at several points in the process. For other projects, you may want to fit models, assess those models, adjust the models with new data, and then re-assess. . Comprehensive Model Evaluation and Selection SEMMA is the logical framework for identifying reliable relationships in the data. The extent to which a relationship fits the data is traditionally measured by statistical quantities such as the average squared error or misclassification rate. Alternatively, each modeling node can evaluate and select a candidate model using a set of profit (or loss) functions, which can be in the form of a decision matrix for categorical target variables. A decision matrix is created by specifying a profit (or loss) for each value of the target. When a model makes a prediction for an observation, each function is applied to the prediction, and the model selects the function that has the largest profit (or the smallest loss). The selected function is called the decision, and the set of candidate functions is the set of decision alternatives. Thus, a model not only makes a prediction for an observation, it makes a decision about the observation, and estimates a profit or loss. A profit function may become more realistic by subtracting a cost that depends on the individual observation. For each observation, each model estimates the profit over cost, or the return on investment. Decision alternatives can be defined for data sets 3 and shared across project diagrams. Decision alternatives can be tailored specifically for each target variable using a target profiler accessible in the Input Data Source node, the Data Set Attributes node, and the modeling nodes (Regression, Tree, Neural Network, User Defined Model, and Ensemble). You also can specify explicit prior probabilities for nominal and ordinal targets in the target profiler.Warehouse Data For obtaining samples of data from data warehouses and other types of data stores, Enterprise Min...</p>