data mining techniques and dss

DM

Myths of DM

Techniques of DM

Myth #1:Data mining provides instant crystal ball-predictions

Data mining is neither a crystal ball nor a technology where answers magically appear after pushing a single button. It's a multi-step process that includes: defining the business problem, exploring and conditioning data, developing the model, and deploying the knowledge gained. Typically, companies spend the bulk of their time preprocessing and conditioning the data to make sure it is clean, consistent, and combined properly to deliver business intelligence on which they can rely. Data mining is all about the data -- successful data mining requires data that accurately reflects the business.

Myth #2: Data mining is not yet viable for business application

Data mining is viable technology and highly prized for its business results.

The myth tends to be perpetrated by those who need to explain why they are not yet using the process and revolves around two related statements.

Myth #3: Data mining requires a data warehouse

It is true that data mining can benefit from warehoused data that is well organized, relatively clean, and easy to access.

This is particularly true if the warehouse has been constructed with data mining specifically in mind and with knowledge of the requirements of the data mining project.

However, the warehoused data may be less useful for data mining than the source or operational data. In the worst case, warehoused data may be completely useless (for example, if only summary data are stored).

Myth #4: DM is all about algorithms

People often misunderstood that "All you need for data mining is good algorithms. The better your algorithms, the better your data mining; advancing the effectiveness of data mining means advancing our knowledge of algorithms.“

This is often to misunderstand the data mining process. Data mining is a process consisting of many elements, such as formulating business goals, mapping business goals to data mining goals, acquiring, understanding, and pre-processing the data, evaluating and presenting the results of analysis and deploying these results to achieve business benefits.

This is not to minimize the importance of new or improved data mining algorithms

Myth #5: DM should be done by technology expert Quite the opposite is true, due to the paramount

importance of business knowledge in data mining. When performed without business knowledge, data

mining can produce nonsensical or useless results so it is essential that data mining be performed by someone with extensive knowledge of the business problem.

Very seldom is this the same person with extensive knowledge of the data mining technology. It is the responsibility of data mining tool providers to ensure that tools are accessible to business users.

Myth #6: Data mining is for large companies with lots of customer data

The plain fact is that if a company, large or small, has data that accurately reflects the business or its customers, it can build models against that data that lend insights into important business challenges. The amount of customer data a company possesses has never been the issue.

Fundamental concepts of DM

Classification Classification is the operation most commonly

supported by commercial data mining tools. It is the process of sub-dividing a data set with

regard to a number of specific outcomes. For example, classifying customers into ‘high’

and ‘low’ categories with regard to credit risk. The category or ‘class’ into which each

customer is placed is the ‘outcome’ of the classification.

Prediction

Prediction gives the future data states based on past and current data. Prediction can be viewed as a type of classification. Ex: Predicting floods

Techniques for Classification and prediction - decision trees, neural networks, nearest neighbour algorithms

Understanding v Prediction

Sophisticated classification techniques enable us to discover new patterns in large and complex data sets.

Classification is a powerful aid to understanding a particular problem. In some cases, improved understanding is sufficient. It may suggest new initiatives and provide information that improves future decision making.

Often the reason for developing an accurate classification model is to improve our capability for prediction.

Training

A classification model is said to be ‘trained’ on historical data, for which the outcome is known for each record.

But beware over fitting: for example100 per cent of customers called Smith who live at 28 Arcadia Street responded to the offer.

One would then use a separate test dataset of historical data to validate the model.

The model could then be applied to a new, unclassified data set in order to predict the outcome for each record.

Clustering

It is used to find groupings of similar records in a data set without any preconditions as to what that similarity may involve.

Clustering is used to identify interesting groups in a customer base that may not have been recognised before. Often undertaken as an exploratory exercise before doing further data mining using a classification technique.

Techniques for Clustering - cluster analysis, neural networks

Association analysis

Association analysis looks for links between records in a data set.

Sometimes referred to as ‘market basket analysis’, its most common aim is to discover which items are generally purchased at the same time.

Example of Association Analysis

Consider the following beer and nappy example:

500,000 transactions 20,000 transactions contain nappies (4%) 30,000 transactions contain beer (6%) 10,000 transactions contain both nappies and

beer (2%)

Sequential analysis

Sequential analysis looks for temporal links between purchases, rather than relationships between items in a single transaction.

Support (or prevalence)

Measures how often items occur together, as a percentage of the total transactions. In this example, beer and nappies occur together 2% of the time (10,000/500,000).

Confidence (or predictability)

Measures how much a particular item is dependent on another.

Because 20,000 transactions contain nappies and 10,000 of these transactions contain beer, when people buy nappies, they also buy beer 50% of the time.

The confidence for the rule: When people buy nappies they also buy beer 50% of

the time. is 50%. Because 30,000 transactions contain beer and 10,000

of these transactions contain nappies, when people buy beer, they also buy nappies 33.33% of the time.

Expected Confidence

In the absence of any knowledge about what else was bought, we can also make the following assertions from the available data:

People buy nappies 4% of the time. People buy beer 6% of the time. These numbers - 4% and 6% - are called the

expected confidence of buying nappies or beer, regardless of what else is purchased.

Lift

Measures the ratio between the confidence of a rule and the expected confidence that the second product will be purchased. Lift is measures of the strength of an effect.

In our example, the confidence of the nappies-beer buying rule is 50%, whilst the expected confidence is 6% that an arbitrary customer will buy beer. So, the lift provided by the nappies-beer rule is :8.33 (= 50%/6%).

Forecasting

Forecasting (unlike prediction based on classification models) concerns the prediction of continuous values, such a person’s income based on various personal details, or the level of the stock market.

Simpler forecasting problems involve a single continuous value based on a series of unordered examples. More complex problem is to predict one or more values based on a sequential pattern.

Techniques include statistical time-series analysis as well as neural networks.

Techniques used in DM

Regression:

This is used to map data item to a real valued prediction variable. Ex: A college professor wishing to calculate his future savings

Time series analysis:

In this the value of an attribute is examined as it varies over time. Ex: A company trying to analyze to whom the stock can be purchased, whether from X, Y, Z

Techniques used in DM (contd..)

Summarization:This maps the data into subsets with associated simple descriptions. This is also called as characterization or generalization. Ex: Comparison of universities in US is the average SAT or ACT score.

Association rules:This is a model that identifies specific types of data associations. Ex: a grocery store trying to decide whether to put bread on sale.

Overview of DM

Data Mining Steps

Collect the Data Clean the Data Determine what is desired Determine optimal method/tool Mine the data Analyze and verify the results Use the results

Data Mining Steps (contd..)

Data Mining Input

Data mining can effectively deal with inconsistencies in your data. Even If your sources are clean, integrated, and validated, they may contain data about the real world that is simply not true. This noise can, for example, be caused by errors in user input or just plain mistakes of customers filling in questionnaires. If it does not occur too often, data mining tools are able to ignore the noise and still find the overall patterns that exist in your data.

Data Mining Output

The output of data mining can provide you with more flexibility. For example, if you have a budget to mail information to 1000 people about a new product, queries or OLAP analysis directly on your data will never be able to select exactly that number of people from your database. By enhancing your data with an attribute that you can use in your query or OLAP analysis, data mining enables you to find the 1000 people most likely to respond. This example also shows that data mining is not replacing OLAP, but enhancing it.

The Future of Data Mining

In the short-term, the results of data mining will be in profitable, if mundane, business related areas. Micro-marketing campaigns will explore new niches. Advertising will target potential customers with new precision.

In the medium term, data mining may be as common and easy to use as e-mail. We may use these tools to find the best airfare to New York, root out a phone number of a long-lost classmate, or find the best prices on lawn mowers.

The long-term prospects are truly exciting. Imagine intelligent agents turned loose on medical research data or on sub-atomic particle data. Computers may reveal new treatments for diseases or new insights into the nature of the universe. There are potential dangers, though, as discussed below.

Privacy Concerns

What if every telephone call you make, every credit card purchase you make, every flight you take, every visit to the doctor you make, every warranty card you send in, every employment application you fill out, every school record you have, your credit record, every web page you visit ... was all collected together? A lot would be known about you! This is an all-too-real possibility.

In a database, too much information about too many people for anybody is going to make any sense? Not with data mining tools running on massively parallel processing computers! Would you feel comfortable about someone having access to all this data about you? And remember, all this data does not have to reside in one physical location; as the net grows; information of this type becomes more available to more people.

Proposed solutions might be…

Data are intentionally modified from their original version, in order to misinform the recipients or for privacy and security

legislation designed to protect consumers against data security failures by, among other things, requiring companies to notify consumers when their personal information has been compromised.

Expanding universe of data

Nowadays, the world is regarded as an expanding universe of data. We have an infinite amount of data, yet little information. Some people look at this phenomenon as a new paradox of the growth of data, that is, more data means less information. Therefore, there is an urgent need for the development of new techniques to find the required information from huge amount of data.

Expanding universe of data

The following factors make the data mining as a very important technique to extract implicit, previously unknown and potentially useful knowledge from data. Data mining algorithms can find "optimal" clustering or interesting regularities in a Database. Data mining algorithms typically zoom in on interesting sub-parts of the Databases. Networks make it easy to connect Databases. Machine learning techniques make it easier to find interesting connections in Database. Client/Server revolution.

Information as a factor of production Increase in available data Exacerbated by World Wide Website Information overload Computer assistance to filter, select and

interpret data Extend this to allow computers to discover

relevant information In the future machine assistance will become

more and more important

Architecture of Data Mining

Components explained

Database, data warehouse, or other information repository: This is one or a set of databases, data warehouses, spread sheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data.

Database or data warehouse server: The database or data warehouse server is responsible for fetching the relevant data, based on the user's data mining request.


Knowledge base: This is the domain knowledge that is used to guide the search, or evaluate the interestingness of resulting patterns.

Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction.

Knowledge such as user beliefs, which can be used to assess a pattern's interestingness based on its unexpectedness, may also be included.

Other examples of domain knowledge are additional interestingness constraints or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources). .


Data mining engine: This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association analysis, classification, evolution and deviation analysis.


Pattern evaluation module: This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search towards interesting patterns.

It may access interestingness thresholds stored in the knowledge base.

Alternatively, the pattern evaluation module may be integrated with the mining module, depending on the implementation of the data mining method used.


Graphical user interface: This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results.

In addition, this component allows the user to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms.

Classification of DM

Classification according to the kinds of databases mined. A data mining system can be classified

according to the kinds of databases mined. Database systems themselves can be classified according to different criteria (such as data models, or the types of data or applications involved), each of which may require its own data mining technique. Data mining systems can therefore be classified accordingly.


Classification according to the kinds of databases mined. For instance, if classifying according to data

models, we may have a relational, transactional, object-oriented, object-relational, or data warehouse mining system. If classifying according to the special types of data handled, we may have a spatial, time-series, text, or multimedia data mining system, or a World-Wide Web mining system. Other system types include heterogeneous data mining systems, and legacy data mining systems.


Classification according to the kinds of knowledge mined. Data mining systems can be categorized according to

the kinds of knowledge they mine, i.e., based on data mining functionalities, such as characterization, discrimination, association, classification, clustering, trend and evolution analysis, deviation analysis, similarity analysis, etc.


Classification according to the kinds of techniques utilized. These techniques can be described according to the

degree of user interaction involved (e.g., autonomous systems, interactive exploratory systems, query-driven systems), or the methods of data analysis employed (e.g., database-oriented or data warehouse-oriented techniques, machine learning, statistics, visualization, pattern recognition, neural networks, and so on).

A sophisticated data mining system will often adopt multiple data mining techniques or work out an effective, integrated technique which combines the merits of a few individual approaches.

Decision Support Systems (DSS)

A decision support system is a computer-based system that supports the decision making process • Assist decision makers in semi-structured tasks

• Support not replace human judgment

• Highly interactive

• Improve effectiveness of human decision makers

DSS characteristics

Provide support in semi-structured and unstructured situations, includes human judgment and computerized information

Support for various managerial levels Support to individuals and groups Support to interdependent and/or sequential

decisions Support all phases of the decision-making process Support a variety of decision-making processes

and styles

DSS characteristics

Are adaptive Have user friendly interfaces Goal: improve effectiveness of decision making The decision maker controls the decision-

making process End-users can build simple systems Utilizes models for analysis Provides access to a variety of data sources,

formats, and types

Why DSS?

• Increasing complexity of decisions o Technology

o Information:

• “Data, data everywhere, and not the time to think!”

o Number and complexity of options

o Pace of change

Why DSS?

• Increasing availability of computerized support o Inexpensive high-powered computing

o Better software

o More efficient software development process

• Increasing usability of computers o COTS (Commercial Off The Shelf) tools

o Customization

Types of Problems

• Structured

o Repetitive

o Standard solution methods exist

o Complete automation may be feasible

• Unstructured

o One-time

o No standard solutions

o Rely on judgment

o Automation is usually infeasible

• Semi-structured

o Some elements and/or phases of decision making process have repetitive elements

Decision Support Trends

• IT is increasingly pervasive

• Users are increasingly computer savvy

• Computer hardware is increasingly smaller and more powerful

• Systems are increasingly interconnected

• The Web is increasingly interwoven into all aspects of our lives

• Demand for usable, flexible, powerful decision support will continue to grow

• Decision support will be embedded into a wide variety of consumer and business products

Humans and Computers: Complementary Strengths • Human decision makers

o Good at seeing patterns

o Can work with incomplete problem representations

o Exercise subtle judgment we do not know how to automate

o Often unaware of how they perform tasks

o Poor at integrating large numbers of cues

o Unreliable and slow at tedious bookkeeping tasks and complex calculations

Humans and Computers: Complementary Strengths Computers

o Still inferior to humans at pattern recognition, messy unstructured problems

o Good at integrating large numbers of features

o Good at tedious bookkeeping

o Rapid and accurate at complex calculations

DSS classifications

Model Driven DSS: A model-driven DSS emphasizes access to and manipulation of financial, optimization and/or simulation models. Simple quantitative models provide the most elementary level of functionality.

Data Driven DSS: Data-driven DSS emphasizes access to and manipulation of a time-series of internal company data and sometimes external and real-time data. Simple file systems accessed by query and retrieval tools provide the most elementary level of functionality.

DSS classifications

Communication Driven DSS: Communications-driven DSS use network and communications technologies to facilitate decision-relevant collaboration and communication. In these systems, communication technologies are the dominant architectural component.

Document Driven DSS: Document-driven DSS uses computer storage and processing technologies to provide document retrieval and analysis. Large document databases may include scanned documents, hypertext documents, images, sounds and video.

DSS architecture

Data Management subsystem

consists of DSS database, Database management system, Data directory and Query facility. It does the following Captures/ extracts data for inclusion in a DSS

database Updates (adds, deletes, edits, changes) data

records and files Interrelates data from different sources Retrieves data from the database for queries

and reports

Data Management subsystem

Provides comprehensive data security(protection from unauthorised access, recovery capabilities, etc)

Handles personal and unofficial data so that users can experiment with alternative solutions based on their own judgement

Performs complex data manipulation tasks based on queries

Tracks data use within DSS Manages data through a data dictionary

Model Management Sub system

consists of Analog of the database management subsystem, Model base, Model base management system, Modeling language, Model directory, Model execution, integration, and command processor Strategic Models: Non routine mergers, impact

analysis, capital budgeting Tactical Models: Allocation & Control labor

requirements, sales promotion planning Operational Models: Routine-day-to-day production

scheduling, inventory control, quality control Analytical Models: SAS, SPSS, OR, data mining

KBS

Knowledge based Subsystem Provides expertise in solving complex

unstructured and semi-structured problems Expertise provided by an expert system or

other intelligent system Advanced DSS have a knowledge based

(management) component Leads to intelligent DSS Example: Data mining

User interface

User Interface sub system Includes all communication between a user

and the MSS Graphical user interfaces (GUI) Voice recognition and speech synthesis

possible

User

Different usage patterns for the user, the manager, or the decision maker Managers Staff specialists Intermediaries 1. Staff assistant 2. Expert tool

user 3. Business (system) analyst 4. GSS Facilitator