anking spatial data by quality preferences

8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

1/28

1. INTRODUCTION

1.1 SPATIAL DATA

Spatial data also known as geospatial dataorgeographic information it is the data

or information that identifies the geographic location of features and boundaries

on Earth, such as natural or constructed features, oceans, and more. Spatial data

is usually stored as coordinates and topology, and is data that can be mapped.

Spatial data is often accessed, manipulated or analyzed through Geographic

Information Systems (GIS).

According to [Ries1993] spatial information can be divided into three

primary architectures: data, function, and organization. Data architecture

describes the activity performed between two types of data. Organization

architecture describes objects used as input to, or created by, a function.

Functional architecture describes the mission, policies, and rules, which

determine and shape the former two. By exploring each of these architectures,

one can develop a framework to establish spatial design requirements. Spatial

data can be displayed in different ways: point, line, polygon, surface, volume, and

pixel. Each of these display mechanisms has three components: location, shape,

and topology. These components help define the scope of the design

requirements for each Location Control Management level. Along with data, its

properties are equally important. Data properties describe the quality, or

condition of the spatial data.

A spatial preference query ranks objects based on the qualities of features in their

spatial neighborhood. For example, using a real estate agency database of flats


2/28

for lease, a customer may want to rank the flats with respect to the

appropriateness of their location, defined after aggregating the qualities of other

features (e.g., restaurants, cafes, hospital, market, etc.) within their spatial

neighborhood. Such a neighborhood concept can be specified by the user via

different functions. It can be an explicit circular region within a given distance

from the flat. Another intuitive definition is to consider the whole spatial domain

and assign higher weights to the features based on their proximity to the flat.

1.2 SPATIAL QUERY EVALUATION

In this project, we formally define spatial preference queries and proposeappropriate indexing techniques and search algorithms for them. Extensively

evaluation of our methods on both real and synthetic data reveals that an

optimized branch-and-bound solution is efficient and robust with respect to

different parameters spatial database systems manage large collections of

geographic entities, which apart from spatial attributes contain non-spatial

information (e.g., name, size, type, price, etc.). In this project, we study aninteresting type of preference queries, which select the best spatial location with

respect to the quality of facilities in its spatial neighborhood.

Given a set D of interesting objects (e.g., candidate locations), a top-k spatial

preference query retrieves the k objects in D with the highest scores. The score of

an object is defined by the quality of features (e.g., facilities or services) in its

spatial neighborhood. As a motivating example, consider a real estate agency

office that holds a database with available flats for lease. Here feature refers to

a class of objects in a spatial map such as specific facilities or services.


3/28

A customer may wa

the quality of their location

of other features (e.g., res

neighborhood of the flat (

subjective and query-para

respect to non-spatial attri

seafood, price range, etc.).

As another example,

close to a high -quality rest

FIGURE.1.1 EXAMPLES OF

Figure 1.1: illustrates the l

two feature datasets. the

black. Feature points are l

rating providers.

t to rank the contents of this database

s, quantized by aggregating non spatial

taurants, cafes, hospital, market, etc.)

efined by a spatial range around it).

etric. For example, a user may defin

utes of restaurants around it (e.g., whet

the user (e.g., a tourist) wishes to find a

urant and a high-quality cafe.

OP-K SPATIAL PREFERENCE QUERIES

ocations of an object dataset D (hotels)

et F1 (restaurants) in gray, and the se

abeled by quality values that can be

ith respect to

characteristics

in the spatial

uality may be

e quality with

her they serve

hotel p that is

in white, and

t F2 (cafes) in

btained from


4/28

1.3 COMBINATION OF QUALITIES AND RELATIVE LOCATION OF POINTS

A simple score instance, called the range score, binds the neighborhood

region to a circular region at p with radius (shown as a circle), and the aggregate

function to SUM. For instance, the maximum quality of gray and black points

within the circle of p1 are 0.9 and 0.6 respectively, so the score of p1 is p1 =

0.9+0.6 = 1.5. Similarly, we obtain p2 = 1.0+0.1 = 1.1 and _ (p3) = 0.7+0.7 = 1.4.

Hence, the hotel p1 is returned as the top result. In fact, the semantics of the

aggregate function is relevant to the users query. The SUM function attempts to

balance the overall qualities of all features. For the MIN function, the top resultbecomes p3, with the score_ (p3) = MIN {0.7, 0.7} = 0.7. It ensures that the top

result has reasonably high qualities in all features. For the MAX function, the top

result is p2, with (p2) = MAX {1.0, 0.1} = 1.0. It is used to optimize the quality in a

particular feature, but not necessarily all of them.

Object ranking is a popular retrieval task in various applications. In relational

databases, we rank tuples using an aggregate score function on their attributevalues For example, a real estate agency maintains a database that contains

information of flats available for rent. A potential customer wishes to view the

top-10 flats with the largest sizes and lowest prices. In this case, the score of each

flat is expressed by the sum of two qualities: size and price, after normalization to

the domain [0, 1] (e.g., 1 means the largest size and the lowest price). In spatial

databases, ranking is often associated to nearest neighbor (NN) retrieval. Given a

query location, we are interested in retrieving the set of nearest objects to it that

qualify a condition (e.g., restaurants).

Assuming that the set of interesting objects is indexed by an R-tree, we can

apply distance bounds and traverse the index in a branch-and-bound fashion to


5/28

obtain the answer Nevertheless, it is not always possible to use multi dimensional

indexes for top-k retrieval. First, such indexes break-down in high dimensional

spaces.

Second, top-k queries may involve an arbitrary set of user-specified

attributes (e.g., size and price) from possible ones (e.g., size, price, distance to the

beach, number of bedrooms, floor, etc.) And indexes may not be available for all

possible attribute combinations (i.e., they are too expensive to create and

maintain).

Third, information for different rankings to be combined (i.e., for different

attributes) could appear in different databases (in a distributed database

scenario) and unified indexes may not exist for them. Solutions for top-k queries

focus on the efficient merging of object rankings that may arrive from different

(distributed) sources. Their motivation is to minimize the number of accesses to

the input rankings until the objects with the top-k aggregate scores have been

identified. To achieve this, upper and lower bounds for the objects seen so far are

maintained while scanning the sorted lists. The most popular spatial access

method is the R-tree, which indexes minimum bounding rectangles (MBRs) of

objects-trees can efficiently process main spatial query types, including spatial

range queries, nearest neighbor queries, and spatial joins. Given a spatial region

W, a spatial range query retrieves from D the objects that intersect W.


6/28

For instance, consid

shaded area starting fro

recursively following entri

instance, e1 does not inters

FI

Figure 1.2: shows a set D =

tree that indexes them.

er a range query that asks for all obje

the root of the tree, the query is

s, having MBRs that intersect the que

ect the query

URE 1.2 SPATIAL QUERIES ON R-TREES

{p1, p2 ...p8} of spatial objects (e.g., poi

cts within the

processed by

ry region. For

nts) and an R-


7/28

2. DATA MINING

2.1 MOTIVATION

The major reason that data mining has attracted a great deal of attention ininformation industry in recent years is due to the wide availability of huge

amounts of data and the imminent need for turning such data into useful

information and knowledge. The information and knowledge gained can be used

for applications ranging from business management, production control, and

market analysis, to engineering design and science exploration. Data mining can

be viewed as a result of the natural evolution of information technology. Anevolutionary path has been witnessed in the database industry in the

development of the following functionalities:

Data collection and database creation, data management (including data

storage and retrieval, and database transaction processing)

Data analysis and understanding (involving data warehousing and data

mining).

For instance, the early development of data collection and database creation

mechanisms served as a prerequisite for later development of elective

mechanisms for data storage and retrieval, and query and transaction processing.

With numerous database systems opening query and transaction processing as

common practice, data analysis and understanding has naturally become the next

target.

Since the 1960's, database and information technology has been evolving

systematically from primitive processing systems to sophisticated and powerful

databases systems. The research and development in database systems .since the


8/28

1970's has led to the development of relational database systems (where data are

stored in relational table structures, data modeling tools, and indexing and data

organization techniques. In addition, users gained convenient and flexible data

access through query languages, query processing, and user interfaces methods

for on-line transaction processing (OLTP), where a query is viewed as a read-only

transaction, have contributed substantially to the evolution and wide acceptance

of relational technology as a major tool for ancient storage, retrieval, and

management of large amounts of data.

Database technology since the mid-1980s has been characterized by the popular

adoption of relational technology and an upsurge of research and development

activities on new and powerful database systems. These employ advanced data

models such as extended-relational, object-oriented, object-relational, and

deductive models Application oriented database systems, including spatial,

temporal, multimedia, active, and scientific databases, knowledge bases, and

once information bases, have flourished. Issues related to the distribution,

diversification, and sharing of data have been studied extensively. Heterogeneous

database systems and Internet-based global information systems such as the

World-Wide Web (WWW) also emerged and play a vital role in the information

industry.

The steady and amazing progress of computer hardware technology in the past

three decades has led to powerful, adorable, and large supplies of computers,

data collection equipment, and storage media. This technology provides a great

boost to the database and information industry, and makes a huge number of

databases and information repositories available for transaction management,


9/28

information retrieval, and data analysis. Data can now be stored in many different

kinds of databases and information repositories.

One data repository architecture that has emerged is the data warehouse, a

repository of multiple heterogeneous data sources, organized under a unified

schema at a single site in order to facilitate management decision making. Data

warehouse technology includes data cleaning, data integration, and On-Line

Analytical Processing (OLAP), that is, analysis techniques with functionalities such

as summarization, consolidation, and aggregation, as well as the ability to view

information from different angles. Although OLAP tools support multidimensional

analysis and decision making, additional data analysis tools are required for in-

depth analysis, such as data classification, clustering, and the characterization of

data changes over time. In addition, huge volumes of data can be accumulated

beyond databases and data warehouses. Typical examples include the World

Wide Web and data streams, where data flow in and out like streams as in

applications like video surveillance, telecommunication, and sensor networks. The

effective and efficient analysis of data in such different forms becomes a

challenging task. The abundance of data, coupled with the need for powerful data

analysis tools, has been described as a data rich but information poorsituation.

The fast-growing, tremendous amount of data, collected and stored in large and

numerous data repositories, has far exceeded our human ability for

comprehension without powerful tools.

As a result, data collected in large data repositories become data

tombsdata archives that are seldom visited. Consequently, important decisions

are often made based not on the information-rich data stored in data repositories

but rather on a decision makers intuition, simply because the decision maker


10/28

does not have the tools to extract the valuable knowledge embedded in the vast

amounts of data. In addition, consider expert system technologies, which typically

rely on users or domain experts to manually input knowledge into knowledge

bases. Unfortunately, this procedure is prone to biases and errors, and is

extremely time-consuming and costly. Data mining tools perform data analysis

and may uncover important data patterns, contributing greatly to business

strategies, knowledge bases, and scientific and medical research. The widening

gap between data and information calls for a systematic development ofdata

mining tools that will turn data tombs into golden nuggets of knowledge.

2.2 WHAT IS DATA MINING?

Simply stated, data mining refers to extracting or mining knowledge from large

amounts of data. The term is actually a misnomer. Remember that the mining of

gold from rocks or sand is referred to as gold mining rather than rock or sand

mining. Thus, data mining should have been more appropriately named

knowledge mining from data, which is unfortunately somewhat long.

Knowledge mining, a shorter term may not reflect the emphasis on mining from

large amounts of data. Nevertheless, mining is a vivid term characterizing the

process that finds a small set of precious nuggets from a great deal of raw

material Thus, such a misnomer that carries both data and mining became a

popular choice. Many other terms carry a similar or slightly different meaning to

data mining, such as knowledge mining from data, knowledge extraction,

data/pattern analysis, data archaeology, and data dredging.

Many people treat data mining as a synonym for another popularly used

term, Knowledge Discovery from Data, or KDD. Alternatively, others view data

mining as simply an essential step in the process of knowledge discovery.


11/28

Knowledge discovery as a process and consists of an iterative sequence of the

following steps:

1. Data cleaning (to remove noise and inconsistent data)

2. Data integration (where multiple data sources may be combined)

3. Data selection (where data relevant to the analysis task are retrieved from the

database)

4. Data transformation (where data are transformed or consolidated into forms

appropriate for mining by performing summary or aggregation operations, for

instance)

5. Data mining (an essential process where intelligent methods are applied in

order to extract data patterns)

6. Pattern evaluation (to identify the truly interesting patterns representing

knowledge based on some interestingness measures

7. Knowledge presentation (where visualization and knowledge representation

techniques are used to present the mined knowledge to the user)

Steps 1 to 4 are different forms of data preprocessing, where the data are

prepared for mining. The data mining step may interact with the user or a

knowledge base. The interesting patterns are presented to the user, and may be

stored as new knowledge in the knowledge base. Note that according to this

view, data mining is only one step in the entire process, albeit an essential one

because it uncovers hidden patterns for evaluation. We agree that data mining is

a step in the knowledge discovery process. However, in industry, in media, and in

the database research milieu, the term data mining is becoming more popular

than the longer term of knowledge discovery from data. Therefore, in this book,


12/28

we choose to use the term

functionality: data mining

from large amounts of d

information repositories.

Based on this view,

have the following major co

Figure 2.1 Architecture

Database, data war

repository:

This is one or

other kinds of inform

techniques may be p

Database or data

server is responsible

data mining request.

data mining. We adopt a broad view

is the process of discovering interesti

ta stored in databases, data wareho

he architecture of a typical data minin

mponents

f a typical data mining system.

e house, World Wide Web, or othe

set of databases, data warehouses, sp

ation repositories. Data cleaning and d

rformed on the data.

arehouse server: The database or da

for fetching the relevant data, based

f data mining

ng knowledge

ses, or other

g system may

r information

readsheets, or

ta integration

ta warehouse

on the users


13/28

Knowledge base: This is the domain knowledge that is used to guide the

search or evaluate the interestingness of resulting patterns. Such

knowledge can include concept hierarchies, used to organize attributes or

attribute values into different levels of abstraction. Knowledge such as user

beliefs, which can be used to assess a patterns interestingness based on its

unexpectedness, may also be included. Other examples of domain

knowledge are additional interestingness constraints or thresholds, and

metadata (e.g., describing data from multiple heterogeneous sources).

Data mining engine: This is essential to the data mining system and ideally

consists of a set of functional modules for tasks such as characterization,

association and correlation analysis, classification, prediction, cluster

analysis, outlier analysis, and evolution analysis.

Pattern evaluation module: This component typically employs

interestingness measures and interacts with the data mining modules so as

to focus the search toward interesting patterns. It may use interestingness

thresholds to filter out discovered patterns. Alternatively, the pattern

evaluation module may be integrated with the mining module, depending

on the implementation of the data mining method used. For efficient data

mining, it is highly recommended to push the evaluation of pattern

interestingness as deep as possible into the mining process so also confine

the search to only the interesting patterns.

User interface: This module communicates between users and the data

mining system, allowing the user to interact with the system by specifying a

data mining query or task, providing information to help focus the search,

and performing exploratory data mining based on the intermediate data


14/28

mining results. In addition, this component allows the user to browse

database and data warehouse schemas or data structures, evaluate mined

patterns, and visualize the patterns in different forms. From a data

warehouse perspective, data mining can be viewed as an advanced stage of

on-line analytical processing (OLAP). However, data mining goes far beyond

the narrow scope of summarization-style analytical processing of data

warehouse systems by incorporating more advanced techniques for data

analysis.

Although there are many data mining systems on the market, not all of them

can perform true data mining. A data analysis system that does not handle large

amounts of data should be more appropriately categorized as a machine learning

system, a statistical data analysis tool, or an experimental system prototype. A

system that can only perform data or information retrieval, including finding

aggregate values, or that performs deductive query answering in large databases

should be more appropriately categorized as a database system, an information

retrieval system, or a deductive database system.

Data mining involves an integration of techniques from multiple disciplines

such as database and data warehouse technology, statistics, machine learning,

high-performance computing, pattern recognition, neural networks, data

visualization, information retrieval, image and signal processing, and spatial or

temporal data analysis. We adopt a database perspective in our presentation of

data mining in this book. That is, emphasis is placed on efficient and scalable data

mining techniques. For an algorithm to be scalable, its running time should grow

approximately linearly in proportion to the size of the data, given the available

system resources such as main memory and disk space. By performing data


15/28

mining, interesting knowledge, regularities, or high-level information can be

extracted from databases and viewed or browsed from different angles. The

discovered knowledge can be applied to decision making, process control,

information management, and query processing. Therefore, data mining is

considered one of the most important frontiers in database and information

systems and one of the most promising interdisciplinary developments in the

information technology.

2.3 MAJOR ISSUES IN DATA MINING

The major issues in data mining regarding mining methodology, user interaction,

performance, and diverse data types are introduced below:

Mining methodology and user interaction issues: These reflect the kinds of

knowledge mined the ability to mine knowledge at multiple granularities, the use

of domain knowledge, ad hoc mining, and knowledge visualization.

Mining different kinds of knowledge in databases: Because different users can

be interested in different kinds of knowledge, data mining should cover a wide

spectrum of data analysis and knowledge discovery tasks, including data

characterization, discrimination, association and correlation analysis,

classification, prediction, clustering, outlier analysis, and evolution analysis (which

includes trend and similarity analysis). These tasks may use the same database in

different ways and require the development of numerous data mining

techniques.

Interactive mining of knowledge at multiple levels of abstraction: Because it is

difficult to know exactly what can be discovered within a database, the data

mining process should be interactive. For databases containing a huge amount of

data, appropriate sampling techniques can first be applied to facilitate interactive


16/28

data exploration. Interactive mining allows users to focus the search for patterns,

providing and refining data mining requests based on returned results.

Specifically, knowledge should be mined by drilling down, rolling up, and pivoting

through the data space and knowledge space interactively, similar to what OLAP

can do on data cubes. In this way, the user can interact with the data mining

system to view data and discovered patterns at multiple granularities and from

different angles.

Incorporation of background knowledge: Background knowledge, or information

regarding the domain under study, may be used to guide the discovery process

and allow discovered patterns to be expressed in concise terms and at different

levels of abstraction. Domain knowledge related to databases, such as integrity

constraints and deduction rules, can help focus and speed up a data mining

process, or judge the interestingness of discovered patterns.

Data mining query languages and ad hoc data mining: Relational query

languages (such as SQL) allow users to pose ad hoc queries for data retrieval. In a

similar vein, high-level data mining query languages need to be developed to

allow users to describe ad hoc data mining tasks by facilitating the specification of

the relevant sets of data for analysis, the domain knowledge, the kinds of

knowledge to be mined, and the conditions and constraints to be enforced on the

discovered patterns. Such a language should be integrated with a database or

data warehouse query language and optimized for efficient and flexible data

mining.

Presentation and visualization of data mining results: Discovered knowledge

should be expressed in high-level languages, visual representations, or other

expressive forms so that the knowledge can be easily understood and directly


17/28

usable by humans. This is especially crucial if the data mining system is to be

interactive. This requires the system to adopt expressive knowledge

representation techniques, such as trees, tables, rules, graphs, charts, crosstabs,

matrices, or curves.

Handling noisy or incomplete data: The data stored in a database may reflect

noise, exceptional cases, or incomplete data objects. When mining data

regularities, these objects may confuse the process, causing the knowledge model

constructed to over fit the data. As a result, the accuracy of the discovered

patterns can be poor .Data cleaning methods and data analysis methods that can

handle noise are required, as well as outlier mining methods for the discovery and

analysis of exceptional cases.

Pattern evaluationthe interestingness problem: A data mining system can

uncover thousands of patterns. Many of the patterns discovered may be

uninteresting to the given user, representing common knowledge or lacking

novelty. Several challenges discovered patterns, particularly with regard to

subjective measures that estimate the value of patterns with respect to a given

user class, based on user beliefs or expectations. The use of interestingness

measures or user-specified constraints to guide the discovery process and reduce

the search space is another active area of research.

Performance issues: These include efficiency, scalability, and parallelization of

data mining algorithms.

Efficiency and scalability of data mining algorithms: To effectively extract

information from a huge amount of data in databases, data mining algorithms

must be efficient and scalable. In other words, the running time of a data mining

algorithm must be predictable and acceptable in large databases. From a


18/28

database perspective on knowledge discovery, efficiency and scalability are key

issues in the implementation of data mining systems. Many of the issues

discussed above under mining methodology and user interaction must also

consider efficiency and scalability.

Parallel, distributed, and incremental mining algorithms: The huge size of many

databases, the wide distribution of data, and the computational complexity of

some data mining methods are factors motivating the development of parallel

and distributed data mining algorithms. Such algorithms divide the data into

partitions, which are processed in parallel. The results from the partitions are

then merged .Moreover, the high cost of some data mining processes promotes

the need for incremental data mining algorithms that incorporate database

updates without having to mine the entire data again from scratch. Such

algorithms perform knowledge modification incrementally to amend and

strengthen what was previously discovered remain regarding the development of

techniques to assess the interestingness

Handling of relational and complex types of data: Because relational databases

and data warehouses are widely used, the development of efficient and effective

data mining systems for such data is important. However, other databases may

contain complex data objects, hypertext and multimedia data, spatial data,

temporal data, or transaction data. It is unrealistic to expect one system to mine

all kinds of data, given the diversity of data types and different goals of data

mining. Specific data mining systems should be constructed for mining specific

kinds of data.


19/28

Therefore, one may expect to have different data mining systems for different

kinds of data.

Mining information from heterogeneous databases and global information

systems:

Local- and wide-area computer networks (such as the Internet) connect many

sources of data, forming huge, distributed, and heterogeneous databases. The

discovery of knowledge from different sources of structured, semi structured, or

unstructured data with diverse data semantics poses great challenges to data

mining. Data mining may help disclose high-level data regularities in multiple

heterogeneous databases that are unlikely to be discovered by simple query

systems and may improve information exchange and interoperability in

heterogeneous databases. Web mining, which uncovers interesting knowledge

about Web contents, Web structures, Web usage, and Web dynamics, becomes a

very challenging and fast evolving field in data mining.

The above issues are considered major requirements and challenges for the

further evolution of data mining technology. Some of the challenges have been

addressed in recent data mining research and development, to a certain extent,

and are now considered requirements, while others are still at the research stage.

The issues, however, continue to stimulate further investigation and

improvement.


20/28

2.4 TRENDS IN DATA MINING

1. Application exploration: Early data mining applications focused mainly on

helping businesses gain a competitive edge.

2. Scalable and interactive data mining methods: In contrast with traditional data

analysis methods, data mining must be able to handle huge amounts of data

efficiently and, if possible, interactively.

3. Integration of data mining with database systems, data warehouse systems,

and Web database systems: Database systems, data warehouse systems, and the

Web have become mainstream information processing systems.

4. Standardization of data mining language: A standard data mining language or

other standardization efforts will facilitate the systematic development of data

mining solutions, improve interoperability among multiple data mining systems

and functions, and promote the education and use of data mining systems in

industry and society.

5. Visual data mining: Visual data mining is an effective way to discover

knowledge from huge amounts of data

6. Biological data mining: Although biological data mining can be considered

under application exploration or mining complex types of data, the unique

combination of complexity, richness, size, and importance of biological data

warrants special attention in data mining.

7. Data mining and software engineering: As software programs become

increasingly bulky in size, sophisticated in complexity, and tend to originate from


21/28

the integration of multiple components developed by different software teams, it

is an increasingly challenging task to ensure software robustness and reliability.

8. Web mining: Web mining is the application of data mining techniques todiscover patterns from the Web. According to analysis targets, web mining can be

divided into three different types, which are Web usage mining, Web content

mining and Web structure mining.

9. Distributed data mining: Traditional data mining methods, designed to work at

a centralized location, do not work well in many of the distributed computing

environments present today (e.g., the Internet, intranets, local area networks,

high-speed wireless networks, and sensor networks). Advances in distributed data

mining methods are expected.

10. Real-time or time-critical data mining: Many applications involving stream

data (such as e-commerce, Web mining, stock analysis, intrusion detection,

mobile data mining, and data mining for counterterrorism) require dynamic data

mining models to be built in real time. Additional development is needed in this

area.

11.Graph mining, link analysis, and social network analysis: Graph mining, link

analysis, and social network analysis are useful for capturing sequential,

topological, geometric, and other relational characteristics of many scientific data

sets (such as for chemical compounds and biological networks) and social data

sets (such as for the analysis of hidden criminal networks)


22/28

12. Multi relational and multi database data mining: Most data mining

approaches search for patterns in a single relational table or in a single database.

However, most real world data and information are spread across multiple tables

and databases.

13. New methods for mining complex types of data: mining complex types of

data is an important research frontier in data mining. Although progress has been

made in mining stream, time-series, sequence, graph, spatiotemporal,

multimedia, and text data, there is still a huge gap between the needs for these

applications and the available technology.

14. Privacy protection and information security in data mining: An abundance of

recorded personal information available in electronic forms and on the Web,

coupled with increasingly powerful data mining tools, poses a threat to our

privacy and data security

2.5 APPLICATIONS OF DATA MINING

Data Mining for Financial Data Analysis few typical cases:

1. Design and construction of data warehouses for multidimensional data

analysis and data mining

2. Loan payment prediction and customer credit policy analysis

3. Classification and clustering of customers for targeted marketing

4. Detection of money laundering and other financial crimes

5. Data Mining for the Retail Industry


23/28

A few examples ofdata mining in the retail industry:

1. Design and construction of data warehouses based on the benefits of data

mining2. Multidimensional analysis of sales, customers, products, time, and region

3. Analysis of the effectiveness of sales campaigns

4. Customer retentionanalysis of customer loyalty

5. Product recommendation and cross-referencing of items

Data Mining for the Telecommunication Industry

1. Multidimensional analysis of telecommunication data

2. Fraudulent pattern analysis and the identification of unusual patterns

3. Multidimensional association and sequential pattern analysis:

4. Mobile telecommunication services

5. Use of visualization tools in telecommunication data analysis

Data Mining for Biological Data Analysis

1. Semantic integration of heterogeneous, distributed genomic and proteomic

databases

2. Alignment, indexing, similarity search, and comparative analysis of multiple

nucleotide/

protein sequences

3. Discovery of structural patterns and analysis of genetic networks and

protein pathways


24/28


25/28

and attacking networks have prompted intrusion detection to become a critical

component of network administration.

The following are areas in which data mining technology may be applied orfurther developed for intrusion detection:

1. Development of data mining algorithms for intrusion detection

2. Association and correlation analysis, and aggregation to help select and

build discriminating attributes

3. Analysis of stream data

4. Distributed data mining

5. Visualization and querying tools

Data Mining System Products and Research Prototypes data mining systems

should be assessed based on the following multiple features:

1. Data types

2. System issues

3. Data sources

4. Data mining functions and methodologies.

5. Coupling data mining with database and/or data warehouse systems.

6. Scalability

7. Visualization tools

8. Data mining query language and graphical user interface:


26/28

Additional Themes on Data Mining: Theoretical Foundations of Data Mining

1. Data reduction: In this theory, the basis of data mining is to reduce the

data representation2. Data compression: According to this theory, the basis of data mining is to

compress the given data by encoding in terms of bits, association rules,

decision trees, clusters, and so on

3. Pattern discovery: In this theory, the basis of data mining is to discover

patterns occurring in the database, such as associations, classification

models, sequential patterns, and so on4. Probability theory: This is based on statistical theory. In this theory, the

basis of data mining is to discover joint probability distributions of random

variables, for example, Bayesian belief networks or hierarchical Bayesian

models.

5. Microeconomic view: The microeconomic view considers data mining as

the task of finding patterns that are interesting only to the extent that theycan be used in the decision making process of some enterprise (e.g.,

regarding marketing strategies and production plans).

6. Inductive databases: According to this theory, a database schema consists

of data and patterns that are stored in the database.


27/28

Statistical Data Mining techniques:

1. Regression

2. Generalized linear model3. Analysis of variance

4. Mixed effect model

5. Factor analysis

6. Discriminate analysis

7. Time series analysis

8. Survival analysis9. Quality control

Visual and Audio Data Mining

Visual data mining discovers implicit and useful knowledge from large data sets

using data and/or knowledge visualization techniques.

In general, data visualization and data mining can be integrated in the following

ways:

1. Data visualization

2. Data mining result visualization

3. Data mining process visualization

4. Interactive visual data mining


28/28

anking spatial data by quality preferences

Documents