july 13, 2015ics426: introduction1 data warehousing and data mining
Post on 22-Dec-2015
214 views
TRANSCRIPT
April 19, 2023 ICS426: Introduction 2
Course Overview
Introduction
Data Preporcessing
DW and OLAP
Data Mining
April 19, 2023 ICS426: Introduction 3
Motivation
Data flood
Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories
There is a tremendous increase in the amount of data recorded and stored on digital media
We are producing over two exabites (10^18) of data per year
Storage capacity, for a fixed price, appears to be doubling approximately every 9 months
Data stored in world’s databases doubles every 20 months Other growth rate estimates even higher
April 19, 2023 ICS426: Introduction 4
Data, Data everywhere - yet ...
I can’t find the data I need data is scattered over the network many versions, subtle differences
I can’t get the data I need need an expert to get the data
I can’t understand the data I found available data poorly documented
I can’t use the data I found results are unexpected data needs to be transformed from
one form to other
April 19, 2023 ICS426: Introduction 5
Motivation
Very little data will ever be looked at by a human. We are drowning in data, but starving for knowledge! “The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all.
Knowledge Discovery is NEEDED to make sense and use of data.
Solution: Data warehousing and data mining
Data warehousing and On-Line Analytical Processing (OLAP)
Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases
April 19, 2023 ICS426: Introduction 7
KDD Process: Several Key Steps
Learning the application domain
relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation
Data mining
summarization, classification, regression, association, clustering
Pattern evaluation and knowledge presentation
Use of discovered knowledge
April 19, 2023 ICS426: Introduction 8
What is a Data Warehouse?
A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.
[Barry Devlin]
April 19, 2023 ICS426: Introduction 9
What are the users saying...
Data should be integrated across the enterprise
Summary data has a real value to the organization
Historical data holds the key to understanding data over time
What-if capabilities are required
April 19, 2023 ICS426: Introduction 10
What is Data Warehousing?
A process of transforming data into information and making it available to users in a timely enough manner to make a difference
[Forrester Research, April 1996]
Data
Information
April 19, 2023 ICS426: Introduction 11
Evolution
60’s: Batch reports hard to find and analyze information inflexible and expensive, reprogram every new request
70’s: Terminal-based DSS and EIS (executive information systems) still inflexible, not integrated with desktop tools
80’s: Desktop data access and analysis tools query tools, spreadsheets, GUIs easier to use, but only access operational databases
90’s: Data warehousing with integrated OLAP engines and tools 2000’s:
Stream data management and mining Data mining and its applications Web technology (XML, data integration) and global information
systems
April 19, 2023 ICS426: Introduction 12
Very Large Data Bases
Terabytes -- 10^12 bytes:
Petabytes -- 10^15 bytes:
Exabytes -- 10^18 bytes:
Zettabytes -- 10^21 bytes:
Zottabytes -- 10^24 bytes:
Walmart -- 24 Terabytes
Geographic Information Systems
National Medical Records
Weather images
Intelligence Agency Videos
April 19, 2023 ICS426: Introduction 13
Data Warehousing -- It is a process
Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible
A decision support database maintained separately from the organization’s operational database
April 19, 2023 ICS426: Introduction 14
Data Warehouse
A data warehouse is a subject-oriented integrated time-varying non-volatile
collection of data that is used primarily in organizational decision making.
-- Bill Inmon, Building the Data Warehouse 1996
April 19, 2023 ICS426: Introduction 15
Data Warehouse Architecture
Data Warehouse Engine
Optimized Loader
ExtractionCleansing
AnalyzeQuery
Metadata Repository
RelationalDatabases
LegacyData
Purchased Data
ERPSystems
April 19, 2023 ICS426: Introduction 16
Data Warehouse for Decision Support & OLAP
Putting Information technology to help the knowledge worker make faster and better decisions
Which of my customers are most likely to go to the competition?
What product promotions have the biggest impact on revenue?
How did the share price of software companies correlate with profits over last 10 years?
April 19, 2023 ICS426: Introduction 17
Decision Support
Used to manage and control business
Data is historical or point-in-time
Optimized for inquiry rather than update
Use of the system is loosely defined and can be ad-hoc
Used by managers and end-users to understand the business and make judgements
April 19, 2023 ICS426: Introduction 18
Data Mining works with Warehouse Data
Data Warehousing provides the Enterprise with a memory
Data Mining provides the Enterprise with intelligence
April 19, 2023 ICS426: Introduction 19
Why Data Mining
Credit ratings/targeted marketing: Given a database of 100,000 names, which persons are
the least likely to default on their credit cards? Identify likely responders to sales promotions
Fraud detection Which types of transactions are likely to be fraudulent,
given the demographics and transactional history of a particular customer?
Customer relationship management: Which of my customers are likely to be the most loyal,
and which are most likely to leave for a competitor? :
Data Mining helps extract such information
April 19, 2023 ICS426: Introduction 20
Which are our lowest/highest margin
customers ?
Which are our lowest/highest margin
customers ?
Who are my customers and what products are they buying?
Who are my customers and what products are they buying?
Which customers are most likely to go to the competition ?
Which customers are most likely to go to the competition ?
What impact will new products/services
have on revenue and margins?
What impact will new products/services
have on revenue and margins?
What product prom--otions have the biggest
impact on revenue?
What product prom--otions have the biggest
impact on revenue?
What is the most effective distribution
channel?
What is the most effective distribution
channel?
Why DM: A producer wants to know….
April 19, 2023 ICS426: Introduction 21
What is Data Mining?
Data mining: a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc
Many Definitions
Non-trivial extraction of implicit, previously unknown and potentially useful information from huge amount of data
Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns
April 19, 2023 ICS426: Introduction 22
Data Mining: Confluence of Multiple Disciplines
?
20x20 ~ 2^400 10^120 patterns
April 19, 2023 ICS426: Introduction 23
Some basic operations
Predictive: Regression Classification Collaborative Filtering
Descriptive: Clustering / similarity matching Association rules and variants Deviation detection
April 19, 2023 ICS426: Introduction 24
Applications …
Banking: loan/credit card approval
predict good customers based on old customers
Customer relationship management:
identify those who are likely to leave for a competitor.
Targeted marketing:
identify likely responders to promotions
Fraud detection: telecommunications, financial transactions
from an online stream of event identify fraudulent events
Manufacturing and production:
automatically adjust knobs when process parameter changes
April 19, 2023 ICS426: Introduction 25
… Applications
Medicine: disease outcome, effectiveness of treatments
analyze patient disease history: find relationship between diseases
Molecular/Pharmaceutical: identify new drugs
Scientific data analysis:
identify new galaxies by searching for sub clusters
Web site/store design and promotion:
find affinity of visitor to pages and modify layout
April 19, 2023 ICS426: Introduction 26
The course
DS
DS
DS
DW
OLAP
DM
(2) (3)
(4)
Association
Classification
Clustering
(5)
(6)
(7)DS = Data sourceDW = Data warehouseDM = Data MiningDP = Data processing
DP