anking spatial data by quality preferences

Upload: lalitha-naidu

Post on 03-Jun-2018

233 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    1/28

    1. INTRODUCTION

    1.1 SPATIAL DATA

    Spatial data also known as geospatial dataorgeographic information it is the data

    or information that identifies the geographic location of features and boundaries

    on Earth, such as natural or constructed features, oceans, and more. Spatial data

    is usually stored as coordinates and topology, and is data that can be mapped.

    Spatial data is often accessed, manipulated or analyzed through Geographic

    Information Systems (GIS).

    According to [Ries1993] spatial information can be divided into three

    primary architectures: data, function, and organization. Data architecture

    describes the activity performed between two types of data. Organization

    architecture describes objects used as input to, or created by, a function.

    Functional architecture describes the mission, policies, and rules, which

    determine and shape the former two. By exploring each of these architectures,

    one can develop a framework to establish spatial design requirements. Spatial

    data can be displayed in different ways: point, line, polygon, surface, volume, and

    pixel. Each of these display mechanisms has three components: location, shape,

    and topology. These components help define the scope of the design

    requirements for each Location Control Management level. Along with data, its

    properties are equally important. Data properties describe the quality, or

    condition of the spatial data.

    A spatial preference query ranks objects based on the qualities of features in their

    spatial neighborhood. For example, using a real estate agency database of flats

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    2/28

    for lease, a customer may want to rank the flats with respect to the

    appropriateness of their location, defined after aggregating the qualities of other

    features (e.g., restaurants, cafes, hospital, market, etc.) within their spatial

    neighborhood. Such a neighborhood concept can be specified by the user via

    different functions. It can be an explicit circular region within a given distance

    from the flat. Another intuitive definition is to consider the whole spatial domain

    and assign higher weights to the features based on their proximity to the flat.

    1.2 SPATIAL QUERY EVALUATION

    In this project, we formally define spatial preference queries and proposeappropriate indexing techniques and search algorithms for them. Extensively

    evaluation of our methods on both real and synthetic data reveals that an

    optimized branch-and-bound solution is efficient and robust with respect to

    different parameters spatial database systems manage large collections of

    geographic entities, which apart from spatial attributes contain non-spatial

    information (e.g., name, size, type, price, etc.). In this project, we study aninteresting type of preference queries, which select the best spatial location with

    respect to the quality of facilities in its spatial neighborhood.

    Given a set D of interesting objects (e.g., candidate locations), a top-k spatial

    preference query retrieves the k objects in D with the highest scores. The score of

    an object is defined by the quality of features (e.g., facilities or services) in its

    spatial neighborhood. As a motivating example, consider a real estate agency

    office that holds a database with available flats for lease. Here feature refers to

    a class of objects in a spatial map such as specific facilities or services.

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    3/28

    A customer may wa

    the quality of their location

    of other features (e.g., res

    neighborhood of the flat (

    subjective and query-para

    respect to non-spatial attri

    seafood, price range, etc.).

    As another example,

    close to a high -quality rest

    FIGURE.1.1 EXAMPLES OF

    Figure 1.1: illustrates the l

    two feature datasets. the

    black. Feature points are l

    rating providers.

    t to rank the contents of this database

    s, quantized by aggregating non spatial

    taurants, cafes, hospital, market, etc.)

    efined by a spatial range around it).

    etric. For example, a user may defin

    utes of restaurants around it (e.g., whet

    the user (e.g., a tourist) wishes to find a

    urant and a high-quality cafe.

    OP-K SPATIAL PREFERENCE QUERIES

    ocations of an object dataset D (hotels)

    et F1 (restaurants) in gray, and the se

    abeled by quality values that can be

    ith respect to

    characteristics

    in the spatial

    uality may be

    e quality with

    her they serve

    hotel p that is

    in white, and

    t F2 (cafes) in

    btained from

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    4/28

    1.3 COMBINATION OF QUALITIES AND RELATIVE LOCATION OF POINTS

    A simple score instance, called the range score, binds the neighborhood

    region to a circular region at p with radius (shown as a circle), and the aggregate

    function to SUM. For instance, the maximum quality of gray and black points

    within the circle of p1 are 0.9 and 0.6 respectively, so the score of p1 is p1 =

    0.9+0.6 = 1.5. Similarly, we obtain p2 = 1.0+0.1 = 1.1 and _ (p3) = 0.7+0.7 = 1.4.

    Hence, the hotel p1 is returned as the top result. In fact, the semantics of the

    aggregate function is relevant to the users query. The SUM function attempts to

    balance the overall qualities of all features. For the MIN function, the top resultbecomes p3, with the score_ (p3) = MIN {0.7, 0.7} = 0.7. It ensures that the top

    result has reasonably high qualities in all features. For the MAX function, the top

    result is p2, with (p2) = MAX {1.0, 0.1} = 1.0. It is used to optimize the quality in a

    particular feature, but not necessarily all of them.

    Object ranking is a popular retrieval task in various applications. In relational

    databases, we rank tuples using an aggregate score function on their attributevalues For example, a real estate agency maintains a database that contains

    information of flats available for rent. A potential customer wishes to view the

    top-10 flats with the largest sizes and lowest prices. In this case, the score of each

    flat is expressed by the sum of two qualities: size and price, after normalization to

    the domain [0, 1] (e.g., 1 means the largest size and the lowest price). In spatial

    databases, ranking is often associated to nearest neighbor (NN) retrieval. Given a

    query location, we are interested in retrieving the set of nearest objects to it that

    qualify a condition (e.g., restaurants).

    Assuming that the set of interesting objects is indexed by an R-tree, we can

    apply distance bounds and traverse the index in a branch-and-bound fashion to

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    5/28

    obtain the answer Nevertheless, it is not always possible to use multi dimensional

    indexes for top-k retrieval. First, such indexes break-down in high dimensional

    spaces.

    Second, top-k queries may involve an arbitrary set of user-specified

    attributes (e.g., size and price) from possible ones (e.g., size, price, distance to the

    beach, number of bedrooms, floor, etc.) And indexes may not be available for all

    possible attribute combinations (i.e., they are too expensive to create and

    maintain).

    Third, information for different rankings to be combined (i.e., for different

    attributes) could appear in different databases (in a distributed database

    scenario) and unified indexes may not exist for them. Solutions for top-k queries

    focus on the efficient merging of object rankings that may arrive from different

    (distributed) sources. Their motivation is to minimize the number of accesses to

    the input rankings until the objects with the top-k aggregate scores have been

    identified. To achieve this, upper and lower bounds for the objects seen so far are

    maintained while scanning the sorted lists. The most popular spatial access

    method is the R-tree, which indexes minimum bounding rectangles (MBRs) of

    objects-trees can efficiently process main spatial query types, including spatial

    range queries, nearest neighbor queries, and spatial joins. Given a spatial region

    W, a spatial range query retrieves from D the objects that intersect W.

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    6/28

    For instance, consid

    shaded area starting fro

    recursively following entri

    instance, e1 does not inters

    FI

    Figure 1.2: shows a set D =

    tree that indexes them.

    er a range query that asks for all obje

    the root of the tree, the query is

    s, having MBRs that intersect the que

    ect the query

    URE 1.2 SPATIAL QUERIES ON R-TREES

    {p1, p2 ...p8} of spatial objects (e.g., poi

    cts within the

    processed by

    ry region. For

    nts) and an R-

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    7/28

    2. DATA MINING

    2.1 MOTIVATION

    The major reason that data mining has attracted a great deal of attention ininformation industry in recent years is due to the wide availability of huge

    amounts of data and the imminent need for turning such data into useful

    information and knowledge. The information and knowledge gained can be used

    for applications ranging from business management, production control, and

    market analysis, to engineering design and science exploration. Data mining can

    be viewed as a result of the natural evolution of information technology. Anevolutionary path has been witnessed in the database industry in the

    development of the following functionalities:

    Data collection and database creation, data management (including data

    storage and retrieval, and database transaction processing)

    Data analysis and understanding (involving data warehousing and data

    mining).

    For instance, the early development of data collection and database creation

    mechanisms served as a prerequisite for later development of elective

    mechanisms for data storage and retrieval, and query and transaction processing.

    With numerous database systems opening query and transaction processing as

    common practice, data analysis and understanding has naturally become the next

    target.

    Since the 1960's, database and information technology has been evolving

    systematically from primitive processing systems to sophisticated and powerful

    databases systems. The research and development in database systems .since the

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    8/28

    1970's has led to the development of relational database systems (where data are

    stored in relational table structures, data modeling tools, and indexing and data

    organization techniques. In addition, users gained convenient and flexible data

    access through query languages, query processing, and user interfaces methods

    for on-line transaction processing (OLTP), where a query is viewed as a read-only

    transaction, have contributed substantially to the evolution and wide acceptance

    of relational technology as a major tool for ancient storage, retrieval, and

    management of large amounts of data.

    Database technology since the mid-1980s has been characterized by the popular

    adoption of relational technology and an upsurge of research and development

    activities on new and powerful database systems. These employ advanced data

    models such as extended-relational, object-oriented, object-relational, and

    deductive models Application oriented database systems, including spatial,

    temporal, multimedia, active, and scientific databases, knowledge bases, and

    once information bases, have flourished. Issues related to the distribution,

    diversification, and sharing of data have been studied extensively. Heterogeneous

    database systems and Internet-based global information systems such as the

    World-Wide Web (WWW) also emerged and play a vital role in the information

    industry.

    The steady and amazing progress of computer hardware technology in the past

    three decades has led to powerful, adorable, and large supplies of computers,

    data collection equipment, and storage media. This technology provides a great

    boost to the database and information industry, and makes a huge number of

    databases and information repositories available for transaction management,

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    9/28

    information retrieval, and data analysis. Data can now be stored in many different

    kinds of databases and information repositories.

    One data repository architecture that has emerged is the data warehouse, a

    repository of multiple heterogeneous data sources, organized under a unified

    schema at a single site in order to facilitate management decision making. Data

    warehouse technology includes data cleaning, data integration, and On-Line

    Analytical Processing (OLAP), that is, analysis techniques with functionalities such

    as summarization, consolidation, and aggregation, as well as the ability to view

    information from different angles. Although OLAP tools support multidimensional

    analysis and decision making, additional data analysis tools are required for in-

    depth analysis, such as data classification, clustering, and the characterization of

    data changes over time. In addition, huge volumes of data can be accumulated

    beyond databases and data warehouses. Typical examples include the World

    Wide Web and data streams, where data flow in and out like streams as in

    applications like video surveillance, telecommunication, and sensor networks. The

    effective and efficient analysis of data in such different forms becomes a

    challenging task. The abundance of data, coupled with the need for powerful data

    analysis tools, has been described as a data rich but information poorsituation.

    The fast-growing, tremendous amount of data, collected and stored in large and

    numerous data repositories, has far exceeded our human ability for

    comprehension without powerful tools.

    As a result, data collected in large data repositories become data

    tombsdata archives that are seldom visited. Consequently, important decisions

    are often made based not on the information-rich data stored in data repositories

    but rather on a decision makers intuition, simply because the decision maker

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    10/28

    does not have the tools to extract the valuable knowledge embedded in the vast

    amounts of data. In addition, consider expert system technologies, which typically

    rely on users or domain experts to manually input knowledge into knowledge

    bases. Unfortunately, this procedure is prone to biases and errors, and is

    extremely time-consuming and costly. Data mining tools perform data analysis

    and may uncover important data patterns, contributing greatly to business

    strategies, knowledge bases, and scientific and medical research. The widening

    gap between data and information calls for a systematic development ofdata

    mining tools that will turn data tombs into golden nuggets of knowledge.

    2.2 WHAT IS DATA MINING?

    Simply stated, data mining refers to extracting or mining knowledge from large

    amounts of data. The term is actually a misnomer. Remember that the mining of

    gold from rocks or sand is referred to as gold mining rather than rock or sand

    mining. Thus, data mining should have been more appropriately named

    knowledge mining from data, which is unfortunately somewhat long.

    Knowledge mining, a shorter term may not reflect the emphasis on mining from

    large amounts of data. Nevertheless, mining is a vivid term characterizing the

    process that finds a small set of precious nuggets from a great deal of raw

    material Thus, such a misnomer that carries both data and mining became a

    popular choice. Many other terms carry a similar or slightly different meaning to

    data mining, such as knowledge mining from data, knowledge extraction,

    data/pattern analysis, data archaeology, and data dredging.

    Many people treat data mining as a synonym for another popularly used

    term, Knowledge Discovery from Data, or KDD. Alternatively, others view data

    mining as simply an essential step in the process of knowledge discovery.

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    11/28

    Knowledge discovery as a process and consists of an iterative sequence of the

    following steps:

    1. Data cleaning (to remove noise and inconsistent data)

    2. Data integration (where multiple data sources may be combined)

    3. Data selection (where data relevant to the analysis task are retrieved from the

    database)

    4. Data transformation (where data are transformed or consolidated into forms

    appropriate for mining by performing summary or aggregation operations, for

    instance)

    5. Data mining (an essential process where intelligent methods are applied in

    order to extract data patterns)

    6. Pattern evaluation (to identify the truly interesting patterns representing

    knowledge based on some interestingness measures

    7. Knowledge presentation (where visualization and knowledge representation

    techniques are used to present the mined knowledge to the user)

    Steps 1 to 4 are different forms of data preprocessing, where the data are

    prepared for mining. The data mining step may interact with the user or a

    knowledge base. The interesting patterns are presented to the user, and may be

    stored as new knowledge in the knowledge base. Note that according to this

    view, data mining is only one step in the entire process, albeit an essential one

    because it uncovers hidden patterns for evaluation. We agree that data mining is

    a step in the knowledge discovery process. However, in industry, in media, and in

    the database research milieu, the term data mining is becoming more popular

    than the longer term of knowledge discovery from data. Therefore, in this book,

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    12/28

    we choose to use the term

    functionality: data mining

    from large amounts of d

    information repositories.

    Based on this view,

    have the following major co

    Figure 2.1 Architecture

    Database, data war

    repository:

    This is one or

    other kinds of inform

    techniques may be p

    Database or data

    server is responsible

    data mining request.

    data mining. We adopt a broad view

    is the process of discovering interesti

    ta stored in databases, data wareho

    he architecture of a typical data minin

    mponents

    f a typical data mining system.

    e house, World Wide Web, or othe

    set of databases, data warehouses, sp

    ation repositories. Data cleaning and d

    rformed on the data.

    arehouse server: The database or da

    for fetching the relevant data, based

    f data mining

    ng knowledge

    ses, or other

    g system may

    r information

    readsheets, or

    ta integration

    ta warehouse

    on the users

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    13/28

    Knowledge base: This is the domain knowledge that is used to guide the

    search or evaluate the interestingness of resulting patterns. Such

    knowledge can include concept hierarchies, used to organize attributes or

    attribute values into different levels of abstraction. Knowledge such as user

    beliefs, which can be used to assess a patterns interestingness based on its

    unexpectedness, may also be included. Other examples of domain

    knowledge are additional interestingness constraints or thresholds, and

    metadata (e.g., describing data from multiple heterogeneous sources).

    Data mining engine: This is essential to the data mining system and ideally

    consists of a set of functional modules for tasks such as characterization,

    association and correlation analysis, classification, prediction, cluster

    analysis, outlier analysis, and evolution analysis.

    Pattern evaluation module: This component typically employs

    interestingness measures and interacts with the data mining modules so as

    to focus the search toward interesting patterns. It may use interestingness

    thresholds to filter out discovered patterns. Alternatively, the pattern

    evaluation module may be integrated with the mining module, depending

    on the implementation of the data mining method used. For efficient data

    mining, it is highly recommended to push the evaluation of pattern

    interestingness as deep as possible into the mining process so also confine

    the search to only the interesting patterns.

    User interface: This module communicates between users and the data

    mining system, allowing the user to interact with the system by specifying a

    data mining query or task, providing information to help focus the search,

    and performing exploratory data mining based on the intermediate data

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    14/28

    mining results. In addition, this component allows the user to browse

    database and data warehouse schemas or data structures, evaluate mined

    patterns, and visualize the patterns in different forms. From a data

    warehouse perspective, data mining can be viewed as an advanced stage of

    on-line analytical processing (OLAP). However, data mining goes far beyond

    the narrow scope of summarization-style analytical processing of data

    warehouse systems by incorporating more advanced techniques for data

    analysis.

    Although there are many data mining systems on the market, not all of them

    can perform true data mining. A data analysis system that does not handle large

    amounts of data should be more appropriately categorized as a machine learning

    system, a statistical data analysis tool, or an experimental system prototype. A

    system that can only perform data or information retrieval, including finding

    aggregate values, or that performs deductive query answering in large databases

    should be more appropriately categorized as a database system, an information

    retrieval system, or a deductive database system.

    Data mining involves an integration of techniques from multiple disciplines

    such as database and data warehouse technology, statistics, machine learning,

    high-performance computing, pattern recognition, neural networks, data

    visualization, information retrieval, image and signal processing, and spatial or

    temporal data analysis. We adopt a database perspective in our presentation of

    data mining in this book. That is, emphasis is placed on efficient and scalable data

    mining techniques. For an algorithm to be scalable, its running time should grow

    approximately linearly in proportion to the size of the data, given the available

    system resources such as main memory and disk space. By performing data

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    15/28

    mining, interesting knowledge, regularities, or high-level information can be

    extracted from databases and viewed or browsed from different angles. The

    discovered knowledge can be applied to decision making, process control,

    information management, and query processing. Therefore, data mining is

    considered one of the most important frontiers in database and information

    systems and one of the most promising interdisciplinary developments in the

    information technology.

    2.3 MAJOR ISSUES IN DATA MINING

    The major issues in data mining regarding mining methodology, user interaction,

    performance, and diverse data types are introduced below:

    Mining methodology and user interaction issues: These reflect the kinds of

    knowledge mined the ability to mine knowledge at multiple granularities, the use

    of domain knowledge, ad hoc mining, and knowledge visualization.

    Mining different kinds of knowledge in databases: Because different users can

    be interested in different kinds of knowledge, data mining should cover a wide

    spectrum of data analysis and knowledge discovery tasks, including data

    characterization, discrimination, association and correlation analysis,

    classification, prediction, clustering, outlier analysis, and evolution analysis (which

    includes trend and similarity analysis). These tasks may use the same database in

    different ways and require the development of numerous data mining

    techniques.

    Interactive mining of knowledge at multiple levels of abstraction: Because it is

    difficult to know exactly what can be discovered within a database, the data

    mining process should be interactive. For databases containing a huge amount of

    data, appropriate sampling techniques can first be applied to facilitate interactive

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    16/28

    data exploration. Interactive mining allows users to focus the search for patterns,

    providing and refining data mining requests based on returned results.

    Specifically, knowledge should be mined by drilling down, rolling up, and pivoting

    through the data space and knowledge space interactively, similar to what OLAP

    can do on data cubes. In this way, the user can interact with the data mining

    system to view data and discovered patterns at multiple granularities and from

    different angles.

    Incorporation of background knowledge: Background knowledge, or information

    regarding the domain under study, may be used to guide the discovery process

    and allow discovered patterns to be expressed in concise terms and at different

    levels of abstraction. Domain knowledge related to databases, such as integrity

    constraints and deduction rules, can help focus and speed up a data mining

    process, or judge the interestingness of discovered patterns.

    Data mining query languages and ad hoc data mining: Relational query

    languages (such as SQL) allow users to pose ad hoc queries for data retrieval. In a

    similar vein, high-level data mining query languages need to be developed to

    allow users to describe ad hoc data mining tasks by facilitating the specification of

    the relevant sets of data for analysis, the domain knowledge, the kinds of

    knowledge to be mined, and the conditions and constraints to be enforced on the

    discovered patterns. Such a language should be integrated with a database or

    data warehouse query language and optimized for efficient and flexible data

    mining.

    Presentation and visualization of data mining results: Discovered knowledge

    should be expressed in high-level languages, visual representations, or other

    expressive forms so that the knowledge can be easily understood and directly

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    17/28

    usable by humans. This is especially crucial if the data mining system is to be

    interactive. This requires the system to adopt expressive knowledge

    representation techniques, such as trees, tables, rules, graphs, charts, crosstabs,

    matrices, or curves.

    Handling noisy or incomplete data: The data stored in a database may reflect

    noise, exceptional cases, or incomplete data objects. When mining data

    regularities, these objects may confuse the process, causing the knowledge model

    constructed to over fit the data. As a result, the accuracy of the discovered

    patterns can be poor .Data cleaning methods and data analysis methods that can

    handle noise are required, as well as outlier mining methods for the discovery and

    analysis of exceptional cases.

    Pattern evaluationthe interestingness problem: A data mining system can

    uncover thousands of patterns. Many of the patterns discovered may be

    uninteresting to the given user, representing common knowledge or lacking

    novelty. Several challenges discovered patterns, particularly with regard to

    subjective measures that estimate the value of patterns with respect to a given

    user class, based on user beliefs or expectations. The use of interestingness

    measures or user-specified constraints to guide the discovery process and reduce

    the search space is another active area of research.

    Performance issues: These include efficiency, scalability, and parallelization of

    data mining algorithms.

    Efficiency and scalability of data mining algorithms: To effectively extract

    information from a huge amount of data in databases, data mining algorithms

    must be efficient and scalable. In other words, the running time of a data mining

    algorithm must be predictable and acceptable in large databases. From a

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    18/28

    database perspective on knowledge discovery, efficiency and scalability are key

    issues in the implementation of data mining systems. Many of the issues

    discussed above under mining methodology and user interaction must also

    consider efficiency and scalability.

    Parallel, distributed, and incremental mining algorithms: The huge size of many

    databases, the wide distribution of data, and the computational complexity of

    some data mining methods are factors motivating the development of parallel

    and distributed data mining algorithms. Such algorithms divide the data into

    partitions, which are processed in parallel. The results from the partitions are

    then merged .Moreover, the high cost of some data mining processes promotes

    the need for incremental data mining algorithms that incorporate database

    updates without having to mine the entire data again from scratch. Such

    algorithms perform knowledge modification incrementally to amend and

    strengthen what was previously discovered remain regarding the development of

    techniques to assess the interestingness

    Handling of relational and complex types of data: Because relational databases

    and data warehouses are widely used, the development of efficient and effective

    data mining systems for such data is important. However, other databases may

    contain complex data objects, hypertext and multimedia data, spatial data,

    temporal data, or transaction data. It is unrealistic to expect one system to mine

    all kinds of data, given the diversity of data types and different goals of data

    mining. Specific data mining systems should be constructed for mining specific

    kinds of data.

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    19/28

    Therefore, one may expect to have different data mining systems for different

    kinds of data.

    Mining information from heterogeneous databases and global information

    systems:

    Local- and wide-area computer networks (such as the Internet) connect many

    sources of data, forming huge, distributed, and heterogeneous databases. The

    discovery of knowledge from different sources of structured, semi structured, or

    unstructured data with diverse data semantics poses great challenges to data

    mining. Data mining may help disclose high-level data regularities in multiple

    heterogeneous databases that are unlikely to be discovered by simple query

    systems and may improve information exchange and interoperability in

    heterogeneous databases. Web mining, which uncovers interesting knowledge

    about Web contents, Web structures, Web usage, and Web dynamics, becomes a

    very challenging and fast evolving field in data mining.

    The above issues are considered major requirements and challenges for the

    further evolution of data mining technology. Some of the challenges have been

    addressed in recent data mining research and development, to a certain extent,

    and are now considered requirements, while others are still at the research stage.

    The issues, however, continue to stimulate further investigation and

    improvement.

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    20/28

    2.4 TRENDS IN DATA MINING

    1. Application exploration: Early data mining applications focused mainly on

    helping businesses gain a competitive edge.

    2. Scalable and interactive data mining methods: In contrast with traditional data

    analysis methods, data mining must be able to handle huge amounts of data

    efficiently and, if possible, interactively.

    3. Integration of data mining with database systems, data warehouse systems,

    and Web database systems: Database systems, data warehouse systems, and the

    Web have become mainstream information processing systems.

    4. Standardization of data mining language: A standard data mining language or

    other standardization efforts will facilitate the systematic development of data

    mining solutions, improve interoperability among multiple data mining systems

    and functions, and promote the education and use of data mining systems in

    industry and society.

    5. Visual data mining: Visual data mining is an effective way to discover

    knowledge from huge amounts of data

    6. Biological data mining: Although biological data mining can be considered

    under application exploration or mining complex types of data, the unique

    combination of complexity, richness, size, and importance of biological data

    warrants special attention in data mining.

    7. Data mining and software engineering: As software programs become

    increasingly bulky in size, sophisticated in complexity, and tend to originate from

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    21/28

    the integration of multiple components developed by different software teams, it

    is an increasingly challenging task to ensure software robustness and reliability.

    8. Web mining: Web mining is the application of data mining techniques todiscover patterns from the Web. According to analysis targets, web mining can be

    divided into three different types, which are Web usage mining, Web content

    mining and Web structure mining.

    9. Distributed data mining: Traditional data mining methods, designed to work at

    a centralized location, do not work well in many of the distributed computing

    environments present today (e.g., the Internet, intranets, local area networks,

    high-speed wireless networks, and sensor networks). Advances in distributed data

    mining methods are expected.

    10. Real-time or time-critical data mining: Many applications involving stream

    data (such as e-commerce, Web mining, stock analysis, intrusion detection,

    mobile data mining, and data mining for counterterrorism) require dynamic data

    mining models to be built in real time. Additional development is needed in this

    area.

    11.Graph mining, link analysis, and social network analysis: Graph mining, link

    analysis, and social network analysis are useful for capturing sequential,

    topological, geometric, and other relational characteristics of many scientific data

    sets (such as for chemical compounds and biological networks) and social data

    sets (such as for the analysis of hidden criminal networks)

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    22/28

    12. Multi relational and multi database data mining: Most data mining

    approaches search for patterns in a single relational table or in a single database.

    However, most real world data and information are spread across multiple tables

    and databases.

    13. New methods for mining complex types of data: mining complex types of

    data is an important research frontier in data mining. Although progress has been

    made in mining stream, time-series, sequence, graph, spatiotemporal,

    multimedia, and text data, there is still a huge gap between the needs for these

    applications and the available technology.

    14. Privacy protection and information security in data mining: An abundance of

    recorded personal information available in electronic forms and on the Web,

    coupled with increasingly powerful data mining tools, poses a threat to our

    privacy and data security

    2.5 APPLICATIONS OF DATA MINING

    Data Mining for Financial Data Analysis few typical cases:

    1. Design and construction of data warehouses for multidimensional data

    analysis and data mining

    2. Loan payment prediction and customer credit policy analysis

    3. Classification and clustering of customers for targeted marketing

    4. Detection of money laundering and other financial crimes

    5. Data Mining for the Retail Industry

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    23/28

    A few examples ofdata mining in the retail industry:

    1. Design and construction of data warehouses based on the benefits of data

    mining2. Multidimensional analysis of sales, customers, products, time, and region

    3. Analysis of the effectiveness of sales campaigns

    4. Customer retentionanalysis of customer loyalty

    5. Product recommendation and cross-referencing of items

    Data Mining for the Telecommunication Industry

    1. Multidimensional analysis of telecommunication data

    2. Fraudulent pattern analysis and the identification of unusual patterns

    3. Multidimensional association and sequential pattern analysis:

    4. Mobile telecommunication services

    5. Use of visualization tools in telecommunication data analysis

    Data Mining for Biological Data Analysis

    1. Semantic integration of heterogeneous, distributed genomic and proteomic

    databases

    2. Alignment, indexing, similarity search, and comparative analysis of multiple

    nucleotide/

    protein sequences

    3. Discovery of structural patterns and analysis of genetic networks and

    protein pathways

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    24/28

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    25/28

    and attacking networks have prompted intrusion detection to become a critical

    component of network administration.

    The following are areas in which data mining technology may be applied orfurther developed for intrusion detection:

    1. Development of data mining algorithms for intrusion detection

    2. Association and correlation analysis, and aggregation to help select and

    build discriminating attributes

    3. Analysis of stream data

    4. Distributed data mining

    5. Visualization and querying tools

    Data Mining System Products and Research Prototypes data mining systems

    should be assessed based on the following multiple features:

    1. Data types

    2. System issues

    3. Data sources

    4. Data mining functions and methodologies.

    5. Coupling data mining with database and/or data warehouse systems.

    6. Scalability

    7. Visualization tools

    8. Data mining query language and graphical user interface:

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    26/28

    Additional Themes on Data Mining: Theoretical Foundations of Data Mining

    1. Data reduction: In this theory, the basis of data mining is to reduce the

    data representation2. Data compression: According to this theory, the basis of data mining is to

    compress the given data by encoding in terms of bits, association rules,

    decision trees, clusters, and so on

    3. Pattern discovery: In this theory, the basis of data mining is to discover

    patterns occurring in the database, such as associations, classification

    models, sequential patterns, and so on4. Probability theory: This is based on statistical theory. In this theory, the

    basis of data mining is to discover joint probability distributions of random

    variables, for example, Bayesian belief networks or hierarchical Bayesian

    models.

    5. Microeconomic view: The microeconomic view considers data mining as

    the task of finding patterns that are interesting only to the extent that theycan be used in the decision making process of some enterprise (e.g.,

    regarding marketing strategies and production plans).

    6. Inductive databases: According to this theory, a database schema consists

    of data and patterns that are stored in the database.

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    27/28

    Statistical Data Mining techniques:

    1. Regression

    2. Generalized linear model3. Analysis of variance

    4. Mixed effect model

    5. Factor analysis

    6. Discriminate analysis

    7. Time series analysis

    8. Survival analysis9. Quality control

    Visual and Audio Data Mining

    Visual data mining discovers implicit and useful knowledge from large data sets

    using data and/or knowledge visualization techniques.

    In general, data visualization and data mining can be integrated in the following

    ways:

    1. Data visualization

    2. Data mining result visualization

    3. Data mining process visualization

    4. Interactive visual data mining

  • 8/12/2019 ANKING SPATIAL DATA BY QUALITY PREFERENCES

    28/28