ieee final year projects 2011-2012 :: elysium technologies pvt ltd::knowledge and data engineering...

Elysium Technologies Private Limited ISO 9001:2008 A leading Research and Development Division Madurai | Chennai | Trichy | Coimbatore | Kollam| Singapore Website: elysiumtechnologies.com, elysiumtechnologies.info Email: [email protected]

IEEE Final Year Project List 2011-2012

Madurai Elysium Technologies Private Limited

230, Church Road, Annanagar,

Madurai , Tamilnadu – 625 020.

Contact : 91452 4390702, 4392702,

4394702.

eMail: [email protected]

Trichy Elysium Technologies Private Limited

3rd

Floor,SI Towers,

15 ,Melapudur , Trichy,

Tamilnadu – 620 001.

Contact : 91431 - 4002234.


Kollam Elysium Technologies Private Limited

Surya Complex,Vendor junction,

kollam,Kerala – 691 010.

Contact : 91474 2723622.


A b s t r a c t Knowledge and data

Engineering 2011- 2012

01 A Dual Framework and Algorithms for Targeted Online Data Delivery

A variety of emerging online data delivery applications challenge existing techniques for data delivery to human

users, applications, or middleware that are accessing data from multiple autonomous servers. In this paper, we

develop a framework for formalizing and comparing pull-based solutions and present dual optimization

approaches. The first approach, most commonly used nowadays, maximizes user utility under the strict setting

of meeting a priori constraints on the usage of system resources. We present an alternative and more flexible

approach that maximizes user utility by satisfying all users. It does this while minimizing the usage of system

resources. We discuss the benefits of this latter approach and develop an adaptive monitoring solution Satisfy

User Profiles (SUPs). Through formal analysis, we identify sufficient optimality conditions for SUP. Using real

(RSS feeds) and synthetic traces, we empirically analyze the behavior of SUP under varying conditions. Our

experiments show that we can achieve a high degree of satisfaction of user utility when the estimations of SUP

closely estimate the real event stream, and has the potential to save a significant amount of system resources.

We further show that SUP can exploit feedback to improve user utility with only a moderate increase in resource

utilization...

02 A Fast Multiple Longest Common Subsequence (MLCS) Algorithm

Finding the longest common subsequence (LCS) of multiple strings is an NP-hard problem, with many

applications in the areas of bioinformatics and computational genomics. Although significant efforts have been

made to address the problem and its special cases, the increasing complexity and size of biological data require

more efficient methods applicable to an arbitrary number of strings. In this paper, we present a new algorithm

for the general case of multiple LCS (or MLCS) problem, i.e., finding an LCS of any number of strings, and its

parallel realization. The algorithms is based on the dominant point approach and employs a fast divide-and

conquer technique to compute the dominant points. When applied to a case of three strings, our algorithm

demonstrates the same performance as the fastest existing MLCS algorithm designed for that specific case.

When applied to more than three strings, our algorithm is significantly faster than the best existing sequential

methods, reaching up to 2-3 orders of magnitude faster speed on large-size problems. Finally, we present an

efficient parallel implementation of the algorithm. Evaluating the parallel algorithm on a benchmark set of both

random and biological sequences reveals a near-linear speedup with respect to the sequential algorithm.

03 A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification

Feature clustering is a powerful method to reduce the dimensionality of feature vectors for text classification. In

this paper, we propose a fuzzy similarity-based self-constructing algorithm for feature clustering. The words in

the feature vector of a document set are grouped into clusters, based on similarity test. Words that are similar to

each other are grouped into the same cluster. Each cluster is characterized by a membership function with

statistical mean and deviation. When all the words have been fed in, a desired number of clusters are formed

1






Contact : 91452 4390702, 4392702,

4394702.



3rd

Floor,SI Towers,



Contact : 91431 - 4002234.





Contact : 91474 2723622.


automatically. We then have one extracted feature for each cluster. The extracted feature, corresponding to a

cluster, is a weighted combination of the words contained in the cluster. By this algorithm, the derived

membership functions match closely with and describe properly the real distribution of the training data.

Besides, the user need not specify the number of extracted features in advance, and trial-and-error for

determining the appropriate number of extracted features can then be avoided. Experimental results show that

our method can run faster and obtain better extracted features than other methods.

04 A Generic Multilevel Architecture for Time Series Prediction

Rapidly evolving businesses generate massive amounts of time-stamped data sequences and cause a demand

for both univariate and multivariate time series forecasting. For such data, traditional predictive models based

on autoregression are often not sufficient to capture complex nonlinear relationships between multidimensional

features and the time series outputs. In order to exploit these relationships for improved time series forecasting

while also better dealing with a wider variety of prediction scenarios, a forecasting system requires a flexible

and generic architecture to accommodate and tune various individual predictors as well as combination

methods. In reply to this challenge, an architecture for combined, multilevel time series prediction is proposed,

which is suitable for many different universal regressors and combination methods. The key strength of this

architecture is its ability to build a diversified ensemble of individual predictors that form an input to a multilevel

selection and fusion process before the final optimized output is obtained. Excellent generalization ability is

achieved due to the highly boosted complementarity of individual models further enforced through cross-

validation-linked training on exclusive data subsets and ensemble output postprocessing. In a sample

configuration with basic neural network predictors and a mean combiner, the proposed system has been

evaluated in different scenarios and showed a clear prediction performance gain.

05 A Link Analysis Extension of Correspondence Analysis for Mining Relational Databases

This work introduces a link analysis procedure for discovering relationships in a relational database or a graph,

generalizing both simple and multiple correspondence analysis. It is based on a random walk model through the

database defining a Markov chain having as many states as elements in the database. Suppose we are

interested in analyzing the relationships between some elements (or records) contained in two different tables of

the relational database. To this end, in a first step, a reduced, much smaller, Markov chain containing only the

elements of interest and preserving the main characteristics of the initial chain, is extracted by stochastic

complementation [41]. This reduced chain is then analyzed by projecting jointly the elements of interest in the

diffusion map subspace [42] and visualizing the results. This two-step procedure reduces to simple

correspondence analysis when only two tables are defined, and to multiple correspondence analysis when the

database takes the form of a simple star-schema. On the other hand, a kernel version of the diffusion map

distance, generalizing the basic diffusion map distance to directed graphs, is also introduced and the links with

spectral clustering are discussed. Several data sets are analyzed by using the proposed methodology, showing

the usefulness of the technique for extracting relationships in relational databases or graphs.

06 A Machine Learning Approach for Identifying Disease-Treatment Relations in Short Texts

The Machine Learning (ML) field has gained its momentum in almost any domain of research and just recently has

become a reliable tool in the medical domain. The empirical domain of automatic learning is used in tasks such as

medical decision support, medical imaging, protein-protein interaction, extraction of medical knowledge, and for

overall patient management care. ML is envisioned as a tool by which computer-based systems can be integrated

2






Contact : 91452 4390702, 4392702,

4394702.



3rd

Floor,SI Towers,



Contact : 91431 - 4002234.





Contact : 91474 2723622.


in the healthcare field in order to get a better, more efficient medical care. This paper describes a ML-based

methodology for building an application that is capable of identifying and disseminating healthcare information. It

extracts sentences from published medical papers that mention diseases and treatments, and identifies semantic

relations that exist between diseases and treatments. Our evaluation results for these tasks show that the

proposed methodology obtains reliable outcomes that could be integrated in an application to be used in the

medical care domain. The potential value of this paper stands in the ML settings that we propose and in the fact

that we outperform previous results on the same data set.

07 A Personalized Ontology Model for Web Information Gathering

As a model for knowledge description and formalization, ontologies are widely used to represent user profiles in

personalized web information gathering. However, when representing user profiles, many models have utilized

only knowledge from either a global knowledge base or a user local information. In this paper, a personalized

ontology model is proposed for knowledge representation and reasoning over user profiles. This model learns

ontological user profiles from both a world knowledge base and user local instance repositories. The ontology

model is evaluated by comparing it against benchmark models in web information gathering. The results show that

this ontology model is successful.

08 Adaptive Cluster Distance Bounding for High-Dimensional Indexing

We consider approaches for similarity search in correlated, high-dimensional data sets, which are derived within a

clustering framework. We note that indexing by “vector approximation” (VA-File), which was proposed as a

technique to combat the “Curse of Dimensionality,” employs scalar quantization, and hence necessarily ignores

dependencies across dimensions, which represents a source of suboptimality. Clustering, on the other hand,

exploits interdimensional correlations and is thus a more compact representation of the data set. However, existing

methods to prune irrelevant clusters are based on bounding hyperspheres and/or bounding rectangles, whose lack

of tightness compromises their efficiency in exact nearest neighbor search. We propose a new cluster-adaptive

distance bound based on separating hyperplane boundaries of Voronoi clusters to complement our cluster based

index. This bound enables efficient spatial filtering, with a relatively small preprocessing storage overhead and is

applicable to euclidean and Mahalanobis similarity measures. Experiments in exact nearest-neighbor set retrieval,

conducted on real data sets, show that our indexing method is scalable with data set size and data dimensionality

and outperforms several recently proposed indexes. Relative to the VA-File, over a wide range of quantization

resolutions, it is able to reduce random IO accesses, given (roughly) the same amount of sequential IO operations,

by factors reaching 100X and more.

09 Anonymous Publication of Sensitive Transactional Data

Existing research on privacy-preserving data publishing focuses on relational data: in this context, the objective is

to enforce privacy-preserving paradigms, such as k-anonymity and ‘-diversity, while minimizing the information

loss incurred in the anonymizing process (i.e., maximize data utility). Existing techniques work well for fixed-

schema data, with low dimensionality. Nevertheless, certain applications require privacy-preserving publishing of

transactional data (or basket data), which involve hundreds or even thousands of dimensions, rendering existing

methods unusable. We propose two categories of novel anonymization methods for sparse high-dimensional data.

The first category is based on approximate nearest-neighbor (NN) search in high-dimensional spaces, which is

efficiently performed through locality-sensitive hashing (LSH). In the second category, we propose two data

transformations that capture the correlation in the underlying data: 1) reduction to a band matrix and 2) Gray

encoding-based sorting. These representations facilitate the formation of anonymized groups with low information

loss, through an efficient linear-time heuristic. We show experimentally, using real-life data sets, that all our

3






Contact : 91452 4390702, 4392702,

4394702.



3rd

Floor,SI Towers,



Contact : 91431 - 4002234.





Contact : 91474 2723622.


methods clearly outperform existing state of the art. Among the proposed techniques, NN-search yields superior

data utility compared to the band matrix transformation, but incurs higher computational overhead. The data

transformation based on Gray code sorting performs best in terms of both data utility and execution time.

10 Answering Frequent Probabilistic Inference Queries in Databases

Existing solutions for probabilistic inference queries mainly focus on answering a single inference query, but seldom

address the issues of efficiently returning results for a sequence of frequent queries, which is more popular and practical in

many real applications. In this paper, we mainly study the computation caching and sharing among a sequence of inference

queries in databases. The clique tree propagation (CTP) algorithm is first introduced in databases for probabilistic inference

queries. We use the materialized views to cache the intermediate results of the previous inference queries, which might be

shared with the following queries, and consequently reduce the time cost. Moreover, we take the query workload into

account to identify the frequently queried variables. To optimize probabilistic inference queries with CTP, we cache these

frequent query variables into the materialized views to maximize the reuse. Due to the existence of different query plans, we

present heuristics to estimate costs and select the optimal query plan. Finally, we present the experimental evaluation in

relational databases to illustrate the validity and superiority of our approaches in answering frequent probabilistic inference

queries.

11 Authenticated Multistep Nearest Neighbor Search

Multistep processing is commonly used for nearest neighbor (NN) and similarity search in applications involving high

dimensional data and/or costly distance computations. Today, many such applications require a proof of result

correctness. In this setting, clients issue NN queries to a server that maintains a database signed by a trusted

authority. The server returns the NN set along with supplementary information that permits result verification using the

data set signature. An adaptation of the multistep NN algorithm incurs prohibitive network overhead due to the

transmission of false hits, i.e., records that are not in the NN set, but are nevertheless necessary for its verification. In

order to alleviate this problem, we present a novel technique that reduces the size of each false hit. Moreover, we

generalize our solution for a distributed setting, where the database is horizontally partitioned over several servers.

Finally, we demonstrate the effectiveness of the proposed solutions with real data sets of various dimensionalities.

12 Automatic Discovery of Personal Name Aliases from the Web

An individual is typically referred by numerous name aliases on the web. Accurate identification of aliases of a given

person name is useful in various web related tasks such as information retrieval, sentiment analysis, personal name

disambiguation, and relation extraction. We propose a method to extract aliases of a given personal name from the

web. Given a personal name, the proposed method first extracts a set of candidate aliases. Second, we rank the

extracted candidates according to the likelihood of a candidate being a correct alias of the given name. We propose a

novel, automatically extracted lexical pattern-based approach to efficiently extract a large set of candidate aliases from

snippets retrieved from a web search engine. We define numerous ranking scores to evaluate candidate aliases using

three approaches: lexical pattern frequency, word co-occurrences in an anchor text graph, and page counts on the

web. To construct a robust alias detection system, we integrate the different ranking scores into a single ranking

function using ranking support vector machines. We evaluate the proposed method on three data sets: an English

personal names data set, an English place names data set, and a Japanese personal names data set. The proposed

method outperforms numerous baselines and previously proposed name alias extraction methods, achieving a

statistically significant mean reciprocal rank (MRR) of 0.67. Experiments carried out using location names and

Japanese personal names suggest the possibility of extending the proposed method to extract aliases for different

types of named entities, and for different languages. Moreover, the aliases extracted using the proposed method are

successfully utilized in an information retrieval task and improve recall by 20 percent in a relation detection task.

4






Contact : 91452 4390702, 4392702,

4394702.



3rd

Floor,SI Towers,



Contact : 91431 - 4002234.





Contact : 91474 2723622.


13 Automatic Enrichment of Semantic Relation Network and Its Application to Word Sense Disambiguation

The most fundamental step in semantic information processing (SIP) is to construct knowledge base (KB) at the

human level; that is to the general understanding and conception of human knowledge. WordNet has been built to be

the most systematic and as close to the human level and is being applied actively in various works. In one of our

previous research, we found that a semantic gap exists between concept pairs of WordNet and those of real world.

This paper contains a study on the enrichment method to build a KB. We describe the methods and the results for the

automatic enrichment of the semantic relation network. A rule based method using WordNet’s glossaries and an

inference method using axioms for WordNet relations are applied for the enrichment and an enriched WordNet (E-

WordNet) is built as the result. Our experimental results substantiate the usefulness of E-WordNet. An evaluation by

comparison with the human level is attempted. Moreover, WSD-SemNet, a new word sense disambiguation (WSD)

method in which E-WordNet is applied, is proposed and evaluated by comparing it with the state-of-the-art algorithm.

14 Branch-and-Bound for Model Selection and Its Computational Complexity

Branch-and-bound methods are used in various data analysis problems, such as clustering, seriation and feature

selection. Classical approaches of branch-and-bound based clustering search through combinations of various

partitioning possibilities to optimize a clustering cost. However, these approaches are not practically useful for

clustering of image data where the size of data is large. Additionally, the number of clusters is unknown in most of the

image data analysis problems. By taking advantage of the spatial coherency of clusters, we formulate an innovative

branch-and-bound approach, which solves clustering problem as a model-selection problem. In this generalized

approach, cluster parameter candidates are first generated by spatially coherent sampling. A branch-andbound search

is carried out through the candidates to select an optimal subset. This paper formulates this approach and investigates

its average computational complexity. Improved clustering quality and robustness to outliers compared to

conventional iterative approach are demonstrated with experiments.

15 Classification and Novel Class Detection in Concept-Drifting Data Streams under Time Constraints

Most existing data stream classification techniques ignore one important aspect of stream data: arrival of a

novel class. We address this issue and propose a data stream classification technique that integrates a novel

class detection mechanism into traditional classifiers, enabling automatic detection of novel classes before the

true labels of the novel class instances arrive. Novel class detection problem becomes more challenging in the

presence of concept-drift, when the underlying data distributions evolve in streams. In order to determine

whether an instance belongs to a novel class, the classification model sometimes needs to wait for more test

instances to discover similarities among those instances. A maximum allowable wait time Tc is imposed as a

time constraint to classify a test instance. Furthermore, most existing stream classification approaches assume

that the true label of a data point can be accessed immediately after the data point is classified. In reality, a time

delay Tl is involved in obtaining the true label of a data point since manual labeling is time consuming. We show

how to make fast and correct classification decisions under these constraints and apply them to real benchmark

data. Comparison with state-of-the-art stream classification techniques prove the superiority of our approach.

16 Classification Using Streaming Random Forests

We consider the problem of data stream classification, where the data arrive in a conceptually infinite stream,

and the opportunity to examine each record is brief. We introduce a stream classification algorithm that is

5






Contact : 91452 4390702, 4392702,

4394702.



3rd

Floor,SI Towers,



Contact : 91431 - 4002234.





Contact : 91474 2723622.


CoFiDS: A Belief-Theoretic Approach for Automated Collaborative Filtering

online, running in amortized Oð1Þ time, able to handle intermittent arrival of labeled records, and able to adjust

its parameters to respond to changing class boundaries (“concept drift”) in the data stream. In addition, when

blocks of labeled data are short, the algorithm is able to judge internally whether the quality of models updated

from them is good enough for deployment on unlabeled records, or whether further labeled records are required.

Unlike most proposed stream-classification algorithms, multiple target classes can be handled. Experimental

results on real and synthetic data show that accuracy is comparable to a conventional classification algorithm

that sees all of the data at once and is able to make multiple passes over it.

17

Automated Collaborative Filtering (ACF) refers to a group of algorithms used in recommender systems, a

research topic that has received considerable attention due to its e-commerce applications. However, existing

techniques are rarely capable of dealing with imperfections in user-supplied ratings. When such imperfections

(e.g., ambiguities) cannot be avoided, designers resort to simplifying assumptions that impair the system’s

performance and utility. We have developed a novel technique referred to as CoFiDS—Collaborative Filtering

based on Dempster-Shafer belief-theoretic framework—that can represent a wide variety of data imperfections,

propagate them throughout the decision-making process without the need to make simplifying assumptions,

and exploit contextual information. With its DS-theoretic predictions, the domain expert can either obtain a

“hard” decision or can narrow the set of possible predictions to a smaller set. With its capability to handle data

imperfections, CoFiDS widens the applicability of ACF to such critical and sensitive domains as medical

decision support systems and defense-related applications. We describe the theoretical foundation of the

system and report experiments with a benchmark movie data set. We explore some essential aspects of CoFiDS’

behavior and show that its performance compares favorably with other ACF systems.

18 Collaborative Filtering with Personalized Skylines

Collaborative filtering (CF) systems exploit previous ratings and similarity in user behavior to recommend the

top-k objects/ records which are potentially most interesting to the user assuming a single score per object.

However, in various applications, a record (e.g., hotel) maybe rated on several attributes (value, service, etc.), in

which case simply returning the ones with the highest overall scores fails to capture the individual attribute

characteristics and to accommodate different selection criteria. In order to enhance the flexibility of CF, we

propose Collaborative Filtering Skyline (CFS), a general framework that combines the advantages of CF with

those of the skyline operator. CFS generates a personalized skyline for each user based on scores of other

users with similar behavior. The personalized skyline includes objects that are good on certain aspects, and

eliminates the ones that are not interesting on any attribute combination. Although the integration of skylines

and CF has several attractive properties, it also involves rather expensive computations. We face this challenge

through a comprehensive set of algorithms and optimizations that reduce the cost of generating personalized

skylines. In addition to exact skyline processing, we develop an approximate method that provides error

guarantees. Finally, we propose the top-k personalized skyline, where the user specifies the required output

cardinality.

19 Comprehensive Citation Index for Research Networks

The existing Science Citation Index only counts direct citations, whereas PageRank disregards the number of

direct citations. We propose a new Comprehensive Citation Index (CCI) that evaluates both direct and indirect

intellectual influence of research papers, and show that CCI is more reliable in discovering research papers with

far-reaching influence.

6






Contact : 91452 4390702, 4392702,

4394702.



3rd

Floor,SI Towers,



Contact : 91431 - 4002234.





Contact : 91474 2723622.


20 Constrained Skyline Query Processing against Distributed Data Sites

The skyline of a multidimensional point set is a subset of interesting points that are not dominated by others. In

this paper, we investigate constrained skyline queries in a large-scale unstructured distributed environment, where

relevant data are distributed among geographically scattered sites. We first propose a partition algorithm that

divides all data sites into incomparable groups such that the skyline computations in all groups can be parallelized

without changing the final result. We then develop a novel algorithm framework called PaDSkyline for parallel

skyline query processing among partitioned site groups. We also employ intragroup optimization and multifiltering

technique to improve the skyline query processes within each group. In particular, multiple (local) skyline points

are sent together with the query as filtering points, which help identify unqualified local skyline points early on a

data site. In this way, the amount of data to be transmitted via network connections is reduced, and thus, the

overall query response time is shortened further. Cost models and heuristics are proposed to guide the selection of

a given number of filtering points from a superset. A costefficient model is developed to determine how many

filtering points to use for a particular data site. The results of an extensive experimental study demonstrate that our

proposals are effective and efficient.

21 Continuous Monitoring of Distance-Based Range Queries

Given a positive value r, a distance-based range query returns the objects that lie within the distance r of the query

location. In this paper, we focus on the distance-based range queries that continuously change their locations in a

Euclidean space. We present an efficient and effective monitoring technique based on the concept of a safe zone.

The safe zone of a query is the area with a property that while the query remains inside it, the results of the query

remain unchanged. Hence, the query does not need to be reevaluated unless it leaves the safe zone. Our

contributions are as follows: 1) We propose a technique based on powerful pruning rules and a unique access

order which efficiently computes the safe zone and minimizes the I/O cost. 2) We theoretically determine and

experimentally verify the expected distance a query moves before leaving the safe zone and, for majority of

queries, the expected number of guard objects. 3) Our experiments demonstrate that the proposed approach is

close to optimal and is an order of magnitude faster than a naive algorithm. 4) We also extend our technique to

monitor the queries in a road network. Our algorithm is up to two order of magnitude faster than a naive algorithm.

22 Cosdes: A Collaborative Spam Detection System with a Novel E-Mail Abstraction Scheme

E-mail communication is indispensable nowadays, but the e-mail spam problem continues growing drastically. In

recent years, the notion of collaborative spam filtering with near-duplicate similarity matching scheme has been

widely discussed. The primary idea of the similarity matching scheme for spam detection is to maintain a known

spam database, formed by user feedback, to block subsequent near-duplicate spams. On purpose of achieving

efficient similarity matching and reducing storage utilization, prior works mainly represent each e-mail by a

succinct abstraction derived from e-mail content text. However, these abstractions of e-mails cannot fully catch the

evolving nature of spams, and are thus not effective enough in near-duplicate detection. In this paper, we propose

a novel e-mail abstraction scheme, which considers e-mail layout structure to represent e-mails. We present a

procedure to generate the e-mail abstraction using HTML content in e-mail, and this newly devised abstraction can

more effectively capture the near-duplicate phenomenon of spams. Moreover, we design a complete spam

detection system Cosdes (standing for COllaborative Spam DEtection System), which possesses an efficient near-

duplicate matching scheme and a progressive update scheme. The progressive update scheme enables system

Cosdes to keep the most up-to-date information for near-duplicate detection. We evaluate Cosdes on a live data set

collected from a real e-mail server and show that our system outperforms the prior approaches in detection results

and is applicable to the real world.

7






Contact : 91452 4390702, 4392702,

4394702.



3rd

Floor,SI Towers,



Contact : 91431 - 4002234.





Contact : 91474 2723622.


23 Coupling Logical Analysis of Data and Shadow Clustering for Partially Defined Positive Boolean FunctionReconstruction

The problem of reconstructing the AND-OR expression of a partially defined positive Boolean function (pdpBf) is

solved by adopting a novel algorithm, denoted by LSC, which combines the advantages of two efficient techniques,

Logical Analysis of Data (LAD) and Shadow Clustering (SC). The kernel of the approach followed by LAD consists

in a breadth-first enumeration of all the prime implicants whose degree is not greater than a fixed maximum d. In

contrast, SC adopts an effective heuristic procedure for retrieving the most promising logical products to be

included in the resulting AND-OR expression. Since the computational cost required by LAD prevents its

application even for relatively small dimensions of the input domain, LSC employs a depth-first approach, with

asymptotically linear memory occupation, to analyze the prime implicants having degree not greater than d. In

addition, the theoretical analysis proves that LSC presents almost the same asymptotic time complexity as LAD.

Extensive simulations on artificial benchmarks validate the good behavior of the computational cost exhibited by

LSC, in agreement with the theoretical analysis. Furthermore, the pdpBf retrieved by LSC always shows a better

performance, in terms of complexity and accuracy, with respect to those obtained by LAD.

24 Data Leakage Detection

We study the following problem: A data distributor has given sensitive data to a set of supposedly trusted agents (third

parties). Some of the data are leaked and found in an unauthorized place (e.g., on the web or somebody’s laptop). The

distributor must assess the likelihood that the leaked data came from one or more agents, as opposed to having been

independently gathered by other means. We propose data allocation strategies (across the agents) that improve the

probability of identifying leakages. These methods do not rely on alterations of the released data (e.g., watermarks). In

some cases, we can also inject “realistic but fake” data records to further improve our chances of detecting leakage

and identifying the guilty party.

25 Decision Trees for Uncertain Data

Traditional decision tree classifiers work with data whose values are known and precise. We extend such classifiers to

handle data with uncertain information. Value uncertainty arises in many applications during the data collection

process. Example sources of uncertainty include measurement/quantization errors, data staleness, and multiple

repeated measurements. With uncertainty, the value of a data item is often represented not by one single value, but by

multiple values forming a probability distribution. Rather than abstracting uncertain data by statistical derivatives

(such as mean and median), we discover that the accuracy of a decision tree classifier can be much improved if the

“complete information” of a data item (taking into account the probability density function (pdf)) is utilized. We extend

classical decision tree building algorithms to handle data tuples with uncertain values. Extensive experiments have

been conducted which show that the resulting classifiers are more accurate than those using value averages. Since

processing pdfs is computationally more costly than processing single values (e.g., averages), decision tree

construction on uncertain data is more CPU demanding than that for certain data. To tackle this problem, we propose a

series of pruning techniques that can greatly improve construction efficiency.

26 Design and Implementation of an Intrusion Response System for Relational Databases

The intrusion response component of an overall intrusion detection system is responsible for issuing a suitable

response to an anomalous request. We propose the notion of database response policies to support our intrusion

response system tailored for a DBMS. Our interactive response policy language makes it very easy for the database

8






Contact : 91452 4390702, 4392702,

4394702.



3rd

Floor,SI Towers,



Contact : 91431 - 4002234.





Contact : 91474 2723622.


administrators to specify appropriate response actions for different circumstances depending upon the nature of the

anomalous request. The two main issues that we address in context of such response policies are that of policy

matching, and policy administration. For the policy matching problem, we propose two algorithms that efficiently

search the policy database for policies that match an anomalous request. We also extend the Posture SQL DBMS with

our policy matching mechanism, and report experimental results. The experimental evaluation shows that our

techniques are very efficient. The other issue that we address is that of administration of response policies to prevent

malicious modifications to policy objects from legitimate users. We propose a novel Joint Threshold Administration

Model (JTAM) that is based on the principle of separation of duty. The key idea in JTAM is that a policy object is jointly

administered by at least k database administrator (DBAs), that is, any modification made to a policy object will be

invalid unless it has been authorized by at least k DBAs. We present design details of JTAM which is based on a

cryptographic threshold signature scheme, and show how JTAM prevents malicious modifications to policy objects

from authorized users. We also implement JTAM in the Posture SQL DBMS, and report experimental results on the

efficiency of our techniques.

27 Differential Privacy via Wavelet Transforms

Privacy-preserving data publishing has attracted considerable research interest in recent years. Among the existing

solutions, E-differential privacy provides the strongest privacy guarantee. Existing data publishing methods that

achieve E-differential privacy, however, offer little data utility. In particular, if the output data set is used to answer

count queries, the noise in the query answers can be proportional to the number of topless in the data, which renders

the results useless. In this paper, we develop a data publishing technique that ensures E-differential privacy while

providing accurate answers for range-count queries, i.e., count queries where the predicate on each attribute is a

range. The core of our solution is a framework that applies wavelet transforms on the data before adding noise to it.

We present instantiations of the proposed framework for both ordinal and nominal data, and we provide a theoretical

analysis on their privacy and utility guarantees. In an extensive experimental study on both real and synthetic data, we

show the effectiveness and efficiency of our solution.

28 Discovering Activities to Recognize and Track in a Smart Environment

The machine learning and pervasive sensing technologies found in smart homes offer unprecedented opportunities for

providing health monitoring and assistance to individuals experiencing difficulties living independently at home. In

order to monitor the functional health of smart home residents, we need to design technologies that recognize and

track activities that people normally perform as part of their daily routines. Although approaches do exist for

recognizing activities, the approaches are applied to activities that have been preselected and for which labeled

training data are available. In contrast, we introduce an automated approach to activity tracking that identifies frequent

activities that naturally occur in an individual’s routine. With this capability, we can then track the occurrence of

regular activities to monitor functional health and to detect changes in an individual’s patterns and lifestyle. In this

paper, we describe our activity mining and tracking approach, and validate our algorithms on data collected in physical

smart environments.

29 Discovering Conditional Functional Dependencies

This paper investigates the discovery of conditional functional dependencies (CFDs). CFDs are a recent

extension of functional dependencies (FDs) by supporting patterns of semantically related constants, and can

be used as rules for cleaning relational data. However, finding quality CFDs is an expensive process that

involves intensive manual effort. To effectively identify data cleaning rules, we develop techniques for

discovering CFDs from relations. Already hard for traditional FDs, the discovery problem is more difficult for

CFDs. Indeed, mining patterns in CFDs introduces new challenges. We provide three methods for CFD

discovery. The first, referred to as CFD Miner, is based on techniques for mining closed item sets, and is used to

9






Contact : 91452 4390702, 4392702,

4394702.



3rd

Floor,SI Towers,



Contact : 91431 - 4002234.





Contact : 91474 2723622.


discover constant CFDs, namely, CFDs with constant patterns only. Constant CFDs are particularly important for

object identification, which is essential to data cleaning and data integration. The other two algorithms are

developed for discovering general CFDs. One algorithm, referred to as CTANE, is a level wise algorithm that

extends TANE, a well-known algorithm for mining FDs. The other, referred to as Fast CFD, is based on the depth-

first approach used in Fast FD, a method for discovering FDs. It leverages closed-item-set mining to reduce the

search space. As verified by our experimental study, CFD Miner can be multiple orders of magnitude faster than

CTANE and Fast CFD for constant CFD discovery. CTANE works well when a given relation is large, but it does

not scale well with the arity of the relation. Fast CFD is far more efficient than CTANE when the arity of the

relation is large; better still, leveraging optimization based on closed-item-set mining, Fast CFD also scales well

with the size of the relation. These algorithms provide a set of cleaning-rule discovery tools for users to choose

for different applications.

30 Effective Navigation of Query Results Based on Concept Hierarchies

Search queries on biomedical databases, such as Pub Med, often return a large number of results, only a small

subset of which is relevant to the user. Ranking and categorization, which can also be combined, have been

proposed to alleviate this information overload problem. Results categorization for biomedical databases is the

focus of this work. A natural way to organize biomedical citations is according to their MeSH annotations. MeSH

is a comprehensive concept hierarchy used by Pub Med. In this paper, we present the BioNav system, a novel

search interface that enables the user to navigate large number of query results by organizing them using the

MeSH concept hierarchy. First, the query results are organized into a navigation tree. At each node expansion

step, BioNav reveals only a small subset of the concept nodes, selected such that the expected user navigation

cost is minimized. In contrast, previous works expand the hierarchy in a predefined static manner, without

navigation cost modeling. We show that the problem of selecting the best concepts to reveal at each node

expansion is NP-complete and propose an efficient heuristic as well as a feasible optimal algorithm for relatively

small trees. We show experimentally that BioNav outperforms state-of-the-art categorization systems by up to an

order of magnitude, with respect to the user navigation cost.

31 Efficient and Accurate Discovery of Patterns in Sequence Data Sets

Existing sequence mining algorithms mostly focus on mining for subsequences. However, a large class of

applications, such as biological DNA and protein motif mining, require efficient mining of “approximate”

patterns that are contiguous. The few existing algorithms that can be applied to find such contiguous

approximate pattern mining have drawbacks like poor scalability, lack of guarantees in finding the pattern, and

difficulty in adapting to other applications. In this paper, we present a new algorithm called FLexible and

Accurate Motif DEtector (FLAME). FLAME is a flexible suffix-tree-based algorithm that can be used to find

frequent patterns with a variety of definitions of motif (pattern) models. It is also accurate, as it always finds the

pattern if it exists. Using both real and synthetic data sets, we demonstrate that FLAME is fast, scalable, and

outperforms existing algorithms on a variety of performance metrics. In addition, based on FLAME, we also

address a more general problem, named extended structured motif extraction, which allows mining frequent

combinations of motifs under relaxed constraints.

32 Efficient Periodicity Mining in Time Series Databases Using Suffix Trees

Periodic pattern mining or periodicity detection has a number of applications, such as prediction, forecasting,

detection of unusual activities, etc. The problem is not trivial because the data to be analyzed are mostly noisy

and different periodicity types (namely symbol, sequence, and segment) are to be investigated. Accordingly, we

argue that there is a need for a comprehensive approach capable of analyzing the whole time series or in a

subsection of it to effectively handle different types of noise (to a certain degree) and at the same time is able to

10






Contact : 91452 4390702, 4392702,

4394702.



3rd

Floor,SI Towers,



Contact : 91431 - 4002234.





Contact : 91474 2723622.


detect different types of periodic patterns; combining these under one umbrella is by itself a challenge. In this

paper, we present an algorithm which can detect symbol, sequence (partial), and segment (full cycle) periodicity

in time series. The algorithm uses suffix tree as the underlying data structure; this allows us to design the

algorithm such that its worstcase complexity is Oðk:n2Þ, where k is the maximum length of periodic pattern and

n is the length of the analyzed portion (whole or subsection) of the time series. The algorithm is noise resilient; it

has been successfully demonstrated to work with replacement, insertion, deletion, or a mixture of these types of

noise. We have tested the proposed algorithm on both synthetic and real data from different domains, including

protein sequences. The conducted comparative study demonstrate the applicability and effectiveness of the

proposed algorithm; it is generally more time-efficient and noise-resilient than existing algorithms.

33 Efficient Relevance Feedback for Content-Based Image Retrieval by Mining User Navigation Patterns

Nowadays, content-based image retrieval (CBIR) is the mainstay of image retrieval systems. To be more

profitable, relevance feedback techniques were incorporated into CBIR such that more precise results can be

obtained by taking user’s feedbacks into account. However, existing relevance feedback-based CBIR methods

usually request a number of iterative feedbacks to produce refined search results, especially in a large-scale

image database. This is impractical and inefficient in real applications. In this paper, we propose a novel method,

Navigation-Pattern-based Relevance Feedback (NPRF), to achieve the high efficiency and effectiveness of CBIR

in coping with the large-scale image data. In terms of efficiency, the iterations of feedback are reduced

substantially by using the navigation patterns discovered from the user query log. In terms of effectiveness, our

proposed search algorithm NPRFSearch makes use of the discovered navigation patterns and three kinds of

query refinement strategies, Query Point Movement (QPM), Query Reweighting (QR), and Query Expansion

(QEX), to converge the search space toward the user’s intention effectively. By using NPRF method, high quality

of image retrieval on RF can be achieved in a small number of feedbacks. The experimental results reveal that

NPRF outperforms other existing methods significantly in terms of precision, coverage, and number of

feedbacks.

34 Efficient Techniques for Online Record Linkage

The need to consolidate the information contained in heterogeneous data sources has been widely documented in

recent years. In order to accomplish this goal, an organization must resolve several types of heterogeneity

problems, especially the entity heterogeneity problem that arises when the same real-world entity type is

represented using different identifiers in different data sources. Statistical record linkage techniques could be used

for resolving this problem. However, the use of such techniques for online record linkage could pose a tremendous

communication bottleneck in a distributed environment (where entity heterogeneity problems are often

encountered). In order to resolve this issue, we develop a matching tree, similar to a decision tree, and use it to

propose techniques that reduce the communication overhead significantly, while providing matching decisions

that are guaranteed to be the same as those obtained using the conventional linkage technique. These techniques

have been implemented, and experiments with real-world and synthetic databases show significant reduction in

communication overhead.

35 Efficient Top-k Approximate Subtree Matching in Small Memory

We consider the Top-k Approximate Sub tree Matching (TASM) problem: finding the k best matches of a small

query tree within a large document tree using the canonical tree edit distance as a similarity measure between sub

trees. Evaluating the tree edit distance for large XML trees is difficult: the best known algorithms have cubic

runtime and quadratic space complexity, and, thus, do not scale. Our solution is TASM-post order, a memory-

efficient and scalable TASM algorithm. We prove an upper bound for the maximum sub tree size for which the tree

11






Contact : 91452 4390702, 4392702,

4394702.



3rd

Floor,SI Towers,



Contact : 91431 - 4002234.





Contact : 91474 2723622.


edit distance needs to be evaluated. The upper bound depends on the query and is independent of the document

size and structure. A core problem is to efficiently prune sub trees that are above this size threshold. We develop

an algorithm based on the prefix ring buffer that allows us to prune all sub trees above the threshold in a single

post order scan of the document. The size of the prefix ring buffer is linear in the threshold. As a result, the space

complexity of TASM-post order depends only on k and the query size, and the runtime of TASM post order is linear

in the size of the document. Our experimental evaluation on large synthetic and real XML documents confirms our

analytic results.

36 Energy Time Series Forecasting Based on Pattern Sequence Similarity

This paper presents a new approach to forecast the behavior of time series based on similarity of pattern

sequences. First, clustering techniques are used with the aim of grouping and labeling the samples from a data set.

Thus, the prediction of a data point is provided as follows: first, the pattern sequence prior to the day to be

predicted is extracted. Then, this sequence is searched in the historical data and the prediction is calculated by

averaging all the samples immediately after the matched sequence. The main novelty is that only the labels

associated with each pattern are considered to forecast the future behavior of the time series, avoiding the use of

real values of the time series until the last step of the prediction process. Results from several energy time series

are reported and the performance of the proposed method is compared to that of recently published techniques

showing a remarkable improvement in the prediction.

37 Estimating and Enhancing Real-Time Data Service Delays: Control-Theoretic Approaches

It is essential to process real-time data service requests such as stock quotes and trade transactions in a timely

manner using fresh data, which represent the current real-world phenomena such as the stock market status.

Users may simply leave when the database service delay is excessive. Also, temporally inconsistent data may give

an outdated view of the real-world status. However, supporting the desired timeliness and freshness is challenging

due to dynamic workloads. To address the problem, we present new approaches for 1) database backlog

estimation, 2) fine-grained closed-loop admission control based on the backlog model, and 3) incoming load

smoothing. Our backlog estimation and control-theoretic approaches aim to support the desired service delay

bound without degrading the data freshness, critical for real-time data services. Specifically, we design, implement,

and evaluate two feedback controllers based on linear control theory and fuzzy logic control theory, to meet the

desired service delay. Workload smoothing, under overload, helps the database admit and process more

transactions in a timely fashion by probabilistically reducing the burstiness of incoming data service requests. In

terms of the data service delay and throughput, our closed-loop admission control and probabilistic load

smoothing schemes considerably outperform several baselines in the experiments undertaken in a stock trading

database test bed.

38 Experience Transfer for the Configuration Tuning in Large-Scale Computing Systems

This paper proposes a new strategy, the experience transfer, to facilitate the management of large-scale computing

systems. It deals with the utilization of management experiences in one system (or previous systems) to benefit the

same management task in other systems (or current systems). We use the system configuration tuning as a case

application to demonstrate all procedures involved in the experience transfer including the experience representation,

experience extraction, and experience embedding. The dependencies between system configuration parameters are

treated as transferable experiences in the configuration tuning for two reasons: 1) because such knowledge is helpful

to the efficiency of the optimal configuration search, and 2) because the parameter dependencies are typically

unchanged between two similar systems. We use the Bayesian network to model configuration dependencies and

present a configuration tuning algorithm based on the Bayesian network construction and sampling. As a result, after

the configuration tuning is completed in the original system, we can obtain a Bayesian network as the by-product

12






Contact : 91452 4390702, 4392702,

4394702.



3rd

Floor,SI Towers,



Contact : 91431 - 4002234.





Contact : 91474 2723622.


which records the dependencies between system configuration parameters. Such a network is then embedded into the

tuning process in other similar systems as transferred experiences to improve the configuration search efficiency.

Experimental results in a web-based system show that with the help of transferred experiences, the configuration

tuning process can be significantly accelerated..

39 Exploring Application-Level Semantics for Data Compression

Natural phenomena show that many creatures form large social groups and move in regular patterns. However,

previous works focus on finding the movement patterns of each single object or all objects. In this paper, we first

propose an efficient distributed mining algorithm to jointly identify a group of moving objects and discover their

movement patterns in wireless sensor networks. Afterward, we propose a compression algorithm, called 2P2D, which

exploits the obtained group movement patterns to reduce the amount of delivered data. The compression algorithm

includes a sequence merge and an entropy reduction phases. In the sequence merge phase, we propose a Merge

algorithm to merge and compress the location data of a group of moving objects. In the entropy reduction phase, we

formulate a Hit Item Replacement (HIR) problem and propose a Replace algorithm that obtains the optimal solution.

Moreover, we devise three replacement rules and derive the maximum compression ratio. The experimental results

show that the proposed compression algorithm leverages the group movement patterns to reduce the amount of

delivered data effectively and efficiently.

40 Extended XML Tree Pattern Matching: Theories and Algorithms

As business and enterprises generate and exchange XML data more often, there is an increasing need for efficient

processing of queries on XML data. Searching for the occurrences of a tree pattern query in an XML database is a core

operation in XML query processing. Prior works demonstrate that holistic twig pattern matching algorithm is an

efficient technique to answer an XML tree pattern with parent-child (P-C) and ancestor-descendant (A-D) relationships,

as it can effectively control the size of intermediate results during query processing. However, XML query languages

(e.g., XPath and XQuery) define more axes and functions such as negation function, order-based axis, and wildcards.

In this paper, we research a large set of XML tree pattern, called extended XML tree pattern, which may include P-C, A-

D relationships, negation functions, wildcards, and order restriction. We establish a theoretical framework about

“matching cross” which demonstrates the intrinsic reason in the proof of optimality on holistic algorithms. Based on

our theorems, we propose a set of novel algorithms to efficiently process three categories of extended XML tree

patterns. A set of experimental results on both real-life and synthetic data sets demonstrate the effectiveness and

efficiency of our proposed theories and algorithms.

41 Finding Correlated Biclusters from Gene Expression Data

Extracting biologically relevant information from DNA microarrays is a very important task for drug development and

test, function annotation, and cancer diagnosis. Various clustering methods have been proposed for the analysis of

gene expression data, but when analyzing the large and heterogeneous collections of gene expression data,

conventional clustering algorithms often cannot produce a satisfactory solution. Biclustering algorithm has been

presented as an alternative approach to standard clustering techniques to identify local structures from gene

expression data set. These patterns may provide clues about the main biological processes associated with different

physiological states. In this paper, different from existing bicluster patterns, we first introduce a more general pattern:

correlated bicluster, which has intuitive biological interpretation. Then, we propose a novel transform technique based

on singular value decomposition so that identifying correlated-bicluster problem from gene expression matrix is

transformed into two global clustering problems. The Mixed-Clustering algorithm and the Lift algorithm are devised to

efficiently produce corBiclusters. The biclusters obtained using our method from gene expression data sets of multiple

human organs and the yeast Saccharomyces cerevisiae demonstrate clear biological meanings.

13






Contact : 91452 4390702, 4392702,

4394702.



3rd

Floor,SI Towers,



Contact : 91431 - 4002234.





Contact : 91474 2723622.


42 Frequent Item Computation on a Chip

Computing frequent items is an important problem by itself and as a subroutine in several data mining algorithms. In

this paper, we explore how to accelerate the computation of frequent items using field-programmable gate arrays

(FPGAs) with a threefold goal: increase performance over existing solutions, reduce energy consumption over CPU-

based systems, and explore the design space in detail as the constraints on FPGAs are very different from those of

traditional software-based systems. We discuss three design alternatives, each one of them exploiting different FPGA

features and each one providing different performance/scalability trade-offs. An important result of the paper is to

demonstrate how the inherent massive parallelism of FPGAs can improve performance of existing algorithms but only

after a fundamental redesign of the algorithms. Our experimental results show that, e.g., the pipelined solution we

introduce can reach more than 100 million tuples per second of sustained throughput (four times the best available

results to date) by making use of techniques that are not available to CPU-based solutions. Moreover, and unlike in

software approaches, the high throughput is independent of the skew of the Zipf distribution of the input and at a far

lower energy cost.

43 Inconsistency-Tolerant Integrity Checking

All methods for efficient integrity checking require all integrity constraints to be totally satisfied, before any

update is executed. However, a certain amount of inconsistency is the rule, rather than the exception in

databases. In this paper, we close the gap between theory and practice of integrity checking, i.e., between the

unrealistic theoretical requirement of total integrity and the practical need for inconsistency tolerance, which we

define for integrity checking methods. We show that most of them can still be used to check whether updates

preserve integrity, even if the current state is inconsistent. Inconsistency-tolerant integrity checking proves

beneficial both for integrity preservation and query answering. Also, we show that it is useful for view updating,

repairs, schema evolution, and other applications.

44 Initialization and Restart in Stochastic Local Search: Computing a Most Probable Explanation in Bayesian Networks

For hard computational problems, stochastic local search has proven to be a competitive approach to finding

optimal or approximately optimal problem solutions. Two key research questions for stochastic local search

algorithms are: Which algorithms are effective for initialization? When should the search process be restarted?

In the present work, we investigate these research questions in the context of approximate computation of most

probable explanations (MPEs) in Bayesian networks (BNs). We introduce a novel approach, based on the Viterbi

algorithm, to explanation initialization in BNs. While the Viterbi algorithm works on sequences and trees, our

approach works on BNs with arbitrary topologies. We also give a novel formalization of stochastic local search,

with focus on initialization and restart, using probability theory and mixture models. Experimentally, we apply

our methods to the problem of MPE computation, using a stochastic local search algorithm known as Stochastic

Greedy Search. By carefully optimizing both initialization and restart, we reduce the MPE search time for

application BNs by several orders of magnitude compared to using uniform at random initialization without

restart. On several BNs from applications, the performance of Stochastic Greedy Search is competitive with

clique tree clustering, a state-of-the-art exact algorithm used for MPE computation in BNs.

14






Contact : 91452 4390702, 4392702,

4394702.



3rd

Floor,SI Towers,



Contact : 91431 - 4002234.





Contact : 91474 2723622.


45 Integration of the HL7 Standard in a Multiagent System to Support Personalized Access to e-Health Services

Abstract—In this paper, we present a multiagent system to support patients in search of healthcare services in

an e-health scenario. The proposed system is HL7-aware in that it represents both patient and service

information according to the directives of HL7, the information management standard adopted in medical

context. Our system builds a profile for each patient and uses it to detect Healthcare Service Providers

delivering e-health services potentially capable of satisfying his needs. In order to handle this search it can

exploit three different algorithms: the first, called PPB, uses only information stored in the patient profile; the

second, called DSPPB, considers both information stored in the patient profile and similarities among the e-

health services delivered by the involved providers; the third, called AB, relies on AT, a popular search

algorithm in Artificial Intelligence. Our system builds also a social network of patients; once a patient submits a

query and retrieves a set of services relevant to him, our system applies a spreading activation technique on this

social network to find other patients who may benefit from these services.

46 Inter temporal Discount Factors as a Measure of Trustworthiness in Electronic Commerce

In multi agent interactions, such as e-commerce and file sharing, being able to accurately assess the

trustworthiness of others is important for agents to protect themselves from losing utility. Focusing on rational

agents in e-commerce, we prove that an agent’s discount factor (time preference of utility) is a direct measure of

the agent’s trustworthiness for a set of reasonably general assumptions and definitions. We propose a general

list of desiderata for trust systems and discuss how discount factors as trustworthiness meet these desiderata.

We discuss how discount factors are a robust measure when entering commitments that exhibit moral hazards.

Using an online market as a motivating example, we derive some analytical methods both for measuring

discount factors and for aggregating the measurements..

47 IR-Tree: An Efficient Index for Geographic Document Search

Abstract—Given a geographic query that is composed of query keywords and a location, a geographic search

engine retrieves documents that are the most textually and spatially relevant to the query keywords and the

location, respectively, and ranks the retrieved documents according to their joint textual and spatial relevances

to the query. The lack of an efficient index that can simultaneously handle both the textual and spatial aspects of

the documents makes existing geographic search engines inefficient in answering geographic queries. In this

paper, we propose an efficient index, called IR-tree, that together with a top-k document search algorithm

facilitates four major tasks in document searches, namely, 1) spatial filtering, 2) textual filtering, 3) relevance

computation, and 4) document ranking in a fully integrated manner. In addition, IR-tree allows searches to adopt

different weights on textual and spatial relevance of documents at the runtime and thus caters for a wide variety

of applications. A set of comprehensive experiments over a wide range of scenarios has been conducted and

the experiment results demonstrate that IR-tree outperforms the state-of-theart approaches for geographic

document searches.

48 Knowledge Discovery in Services (KDS):Aggregating Software Services to Discover Enterprise Mashups

Abstract—Service mashup is the act of integrating the resulting data of two complementary software services into

a common picture. Such an approach is promising with respect to the discovery of new types of knowledge.

However, before service mashup routines can be executed, it is necessary to predict which services (of an open

15






Contact : 91452 4390702, 4392702,

4394702.



3rd

Floor,SI Towers,



Contact : 91431 - 4002234.





Contact : 91474 2723622.


repository) are viable candidates. Similar to Knowledge Discovery in Databases (KDD), we introduce the

Knowledge Discovery in Services (KDS) process that identifies mashup candidates. In this work, the KDS process

is specialized to address a repository of open services that do not contain semantic annotations. In these

situations, specialized techniques are required to determine equivalences among open services with reasonable

precision. This paper introduces a bottom-up process for KDS that adapts to the environment of services for which

it operates. Detailed experiments are discussed that evaluate KDS techniques on an open repository of services

from the Internet and on a repository of services created in a controlled environment.

49 Learning Semi-Riemannian Metrics for Semisupervised Feature Extraction

Abstract—Discriminant feature extraction plays a central role in pattern recognition and classification. Linear

Discriminant Analysis (LDA) is a traditional algorithm for supervised feature extraction. Recently, unlabeled data

have been utilized to improve LDA. However, the intrinsic problems of LDA still exist and only the similarity among

the unlabeled data is utilized. In this paper, we propose a novel algorithm, called Semisupervised Semi-Riemannian

Metric Map (S3RMM), following the geometric framework of semi- Riemannian manifolds. S3RMM maximizes the

discrepancy of the separability and similarity measures of scatters formulated by using semi-Riemannian metric

tensors. The metric tensor of each sample is learned via semisupervised regression. Our method can also be a

general framework for proposing new semisupervised algorithms, utilizing the existing discrepancy-criterion-

based algorithms. The experiments demonstrated on faces and handwritten digits show that S3RMM is promising

for semisupervised feature extraction.

50 Load Shedding in Mobile Systems with MobiQual

Abstract—In location-based, mobile continual query (CQ) systems, two key measures of quality-of-service (QoS)

are: freshness and accuracy. To achieve freshness, the CQ server must perform frequent query reevaluations. To

attain accuracy, the CQ server must receive and process frequent position updates from the mobile nodes.

However, it is often difficult to obtain fresh and accurate CQ results simultaneously, due to 1) limited resources in

computing and communication and 2) fast-changing load conditions caused by continuous mobile node

movement. Hence, a key challenge for a mobile CQ system is: How do we achieve the highest possible quality of

the CQ results, in both freshness and accuracy, with currently available resources? In this paper, we formulate this

problem as a load shedding one, and develop MobiQual—a QoS-aware approach to performing both update load

shedding and query load shedding. The design of MobiQual highlights three important features. 1) Differentiated

load shedding: We apply different amounts of query load shedding and update load shedding to different groups of

queries and mobile nodes, respectively. 2) Per-query QoS specification: Individualized QoS specifications are used

to maximize the overall freshness and accuracy of the query results. 3) Lowcost adaptation: MobiQual dynamically

adapts, with a minimal overhead, to changing load conditions and available resources. We conduct a set of

comprehensive experiments to evaluate the effectiveness of MobiQual. The results show that, through a careful

combination of update and query load shedding, the MobiQual approach leads to much higher freshness and

accuracy in the query results in all cases, compared to existing approaches that lack the QoS-awareness

properties of MobiQual, as well as the solutions that perform query-only or update-only load shedding.

51 Locally Consistent Concept Factorization for Document Clustering

Abstract—Previous studies have demonstrated that document clustering performance can be improved

significantly in lower dimensional linear subspaces. Recently, matrix factorization-based techniques, such as

Nonnegative Matrix Factorization (NMF) and Concept Factorization (CF), have yielded impressive results. However,

both of them effectively see only the global euclidean geometry, whereas the local manifold geometry is not fully

16






Contact : 91452 4390702, 4392702,

4394702.



3rd

Floor,SI Towers,



Contact : 91431 - 4002234.





Contact : 91474 2723622.


considered. In this paper, we propose a new approach to extract the document concepts which are consistent with

the manifold geometry such that each concept corresponds to a connected component. Central to our approach is

a graph model which captures the local geometry of the document submanifold. Thus, we call it Locally Consistent

Concept Factorization (LCCF). By using the graph Laplacian to smooth the document-to-concept mapping, LCCF

can extract concepts with respect to the intrinsic manifold structure and thus documents associated with the same

concept can be well clustered. The experimental results on TDT2 and Reuters-21578 have shown that the proposed

approach provides a better representation and achieves better clustering results in terms of accuracy and mutual

information.

52 Making Aggregation Work in Uncertain and Probabilistic Databases

We describe how aggregation is handled in the Trio system for uncertain and probabilistic data. Because “exact”

aggregation in uncertain databases can produce exponentially sized results, we provide three alternatives: a low

bound on the aggregate value, a high bound on the value, and the expected value. These variants return a single result

instead of a set of possible results, and they are generally efficient to compute for both full-table and grouped

aggregation queries. We provide formal definitions and semantics and a description of our open source

implementation for single-table aggregation queries. We study the performance and scalability of our algorithms

through experiments over a large synthetic data set. We also provide some preliminary results on aggregations over

joins.

53 Mining Cluster-Based Temporal Mobile Sequential Patterns in Location-Based Service Environments

Abstract—Researches on Location-Based Service (LBS) have been emerging in recent years due to a wide range of

potential applications. One of the active topics is the mining and prediction of mobile movements and associated

transactions. Most of existing studies focus on discovering mobile patterns from the whole logs. However, this kind of

patterns may not be precise enough for predictions since the differentiated mobile behaviors among users and

temporal periods are not considered. In this paper, we propose a novel algorithm, namely, Cluster-based Temporal

Mobile Sequential Pattern Mine (CTMSP-Mine), to discover the Cluster-based Temporal Mobile Sequential Patterns

(CTMSPs). Moreover, a prediction strategy is proposed to predict the subsequent mobile behaviors. In CTMSP-Mine,

user clusters are constructed by a novel algorithm named Cluster-Object-based Smart Cluster Affinity Search

Technique (CO-Smart-CAST) and similarities between users are evaluated by the proposed measure, Location-Based

Service Alignment (LBS-Alignment). Meanwhile, a time segmentation approach is presented to find segmenting time

intervals where similar mobile characteristics exist. To our best knowledge, this is the first work on mining and

prediction of mobile behaviors with considerations of user relations and temporal property simultaneously. Through

experimental evaluation under various simulated conditions, the proposed methods are shown to deliver excellent

performance.

54 Mining Discriminative Patterns for Classifying Trajectories on Road Networks

Classification has been used for modeling many kinds of data sets, including sets of items, text documents, graphs,

and networks. However, there is a lack of study on a new kind of data, trajectories on road networks. Modeling such

data is useful with the emerging GPS and RFID technologies and is important for effective transportation and traffic

planning. In this work, we study methods for classifying trajectories on road networks. By analyzing the behavior of

trajectories on road networks, we observe that, in addition to the locations where vehicles have visited, the order of

these visited locations is crucial for improving classification accuracy. Based on our analysis, we contend that

(frequent) sequential patterns are good feature candidates since they preserve this order information. Furthermore,

17






Contact : 91452 4390702, 4392702,

4394702.



3rd

Floor,SI Towers,



Contact : 91431 - 4002234.





Contact : 91474 2723622.


when mining sequential patterns, we propose to confine the length of sequential patterns to ensure high efficiency.

Compared with closed sequential patterns, these partial (i.e., length-confined) sequential patterns allow us to

significantly improve efficiency almost without losing accuracy. In this paper, we present a framework for frequent

pattern-based classification for trajectories on road networks. Our comparative study over a broad range of

classification approaches demonstrates that our method significantly improves accuracy over other methods in some

synthetic and real trajectory data.

55 Mining Group Movement Patterns for Tracking Moving Objects Efficiently

Existing object tracking applications focus on finding the moving patterns of a single object or all objects. In contrast,

we propose a distributed mining algorithm that identifies a group of objects with similar movement patterns. This

information is important in some biological research domains, such as the study of animals’ social behavior and

wildlife migration. The proposed algorithm comprises a local mining phase and a cluster ensembling phase. In the

local mining phase, the algorithm finds movement patterns based on local trajectories. Then, based on the derived

patterns, we propose a new similarity measure to compute the similarity of moving objects and identify the local group

relationships. To address the energy conservation issue in resource-constrained environments, the algorithm only

transmits the local grouping results to the sink node for further ensembling. In the cluster ensembling phase, our

algorithm combines the local grouping results to derive the group relationships from a global view. We further

leverage the mining results to track moving objects efficiently. The results of experiments show that the proposed

mining algorithm achieves good grouping quality, and the mining technique helps reduce the energy consumption by

reducing the amount of data to be transmitted.

56 Mining Iterative Generators and Representative Rules for Software Specification Discovery

Abstract—Billions of dollars are spent annually on software-related cost. It is estimated that up to 45 percent of

software cost is due to the difficulty in understanding existing systems when performing maintenance tasks (i.e.,

adding features, removing bugs, etc.). One of the root causes is that software products often come with poor,

incomplete, or even without any documented specifications. In an effort to improve program understanding, Lo et al.

have proposed iterative pattern mining which outputs patterns that are repeated frequently within a program trace, or

across multiple traces, or both. Frequent iterative patterns reflect frequent program behaviors that likely correspond to

software specifications. To reduce the number of patterns and improve the efficiency of the algorithm, Lo et al. have

also introduced mining closed iterative patterns, i.e., maximal patterns without any superpattern having the same

support. In this paper, to technically deepen research on iterative pattern mining, we introduce mining iterative

generators, i.e., minimal patterns without any subpattern having the same support. Iterative generators can be paired

with closed patterns to produce a set of rules expressing forward, backward, and in-between temporal constraints

among events in one general representation. We refer to these rules as representative rules. A comprehensive

performance study shows the efficiency of our approach. A case study on traces of an industrial system shows how

iterative generators and closed iterative patterns can be merged to form useful rules shedding light on software

design.

57 Missing Value Estimation for Mixed-Attribute Data Sets

Abstract—Missing data imputation is a key issue in learning from incomplete data. Various techniques have

been developed with great successes on dealing with missing values in data sets with homogeneous attributes

(their independent attributes are all either continuous or discrete). This paper studies a new setting of missing

data imputation, i.e., imputing missing data in data sets with heterogeneous attributes (their independent

18






Contact : 91452 4390702, 4392702,

4394702.



3rd

Floor,SI Towers,



Contact : 91431 - 4002234.





Contact : 91474 2723622.


attributes are of different types), referred to as imputing mixed-attribute data sets. Although many real

applications are in this setting, there is no estimator designed for imputing mixed-attribute data sets. This paper

first proposes two consistent estimators for discrete and continuous missing target values, respectively. And

then, a mixture-kernelbased iterative estimator is advocated to impute mixed-attribute data sets. The proposed

method is evaluated with extensive experiments compared with some typical algorithms, and the result

demonstrates that the proposed approach is better than these existing imputation methods in terms of

classification accuracy and root mean square error (RMSE) at different missing ratios.

58 Monochromatic and Bichromatic Reverse Top-k Queries

Nowadays, most applications return to the user a limited set of ranked results based on the individual user’s

preferences, which are commonly expressed through top-k queries. From the perspective of a manufacturer, it is

imperative that her products appear in the highest ranked positions for many different user preferences,

otherwise the product is not visible to potential customers. In this paper, we define a novel query type, namely

the reverse top-k query, that covers this requirement: “Given a potential product, which are the user preferences

that make this product belong to the top-k query result set?.” Reverse top-k queries are essential for

manufacturers to assess the impact of their products in the market based on the competition. We formally define

reverse top-k queries and introduce two versions of the query, monochromatic and bichromatic. First, we

provide a geometric interpretation of the monochromatic reverse top-k query to acquire an intuition of the

solution space. Then, we study in detail the case of bichromatic reverse top-k query, and we propose two

techniques for query processing, namely an efficient threshold-based algorithm and an algorithm based on

materialized reverse top-k views. Our experimental evaluation demonstrates the efficiency of our techniques.

59 On Computing Farthest Dominated Locations

Abstract—In reality, spatial objects (e.g., hotels) not only have spatial locations but also have quality attributes

(e.g., price, star). An object p is said to dominate another one p0, if p is no worse than p0 with respect to every

quality attribute and p is better on at least one quality attribute. Traditional spatial queries (e.g., nearest

neighbor, closest pair) ignore quality attributes, whereas conventional dominance-based queries (e.g., skyline)

neglect spatial locations. Motivated by these observations, we propose a novel query by combining spatial and

quality attributes together meaningfully. Given a set of (competitors’) spatial objects P, a set of (candidate)

locations L, and a quality vector T as design competence (for L), the farthest dominated location (FDL) query

retrieves the location s 2 L such that the distance to its nearest dominating object in P is maximized. FDL

queries are suitable for various spatial decision support applications such as business planning, wild animal

protection, and digital battle field systems. As FDL queries cannot be readily solved by existing techniques, we

develop several efficient R-tree-based algorithms for processing FDL queries, which offer users a range of

selections in terms of different indexes available on the data. We also generalize our methods to support the

generic distance metric and other interesting query types. The experimental results on both real and synthetic

data sets disclose the performance of those algorithms, and reveal the most efficient and scalable one among

them.

60 Optimizing Resource Conflicts in Workflow Management Systems

Abstract—Resource allocation and scheduling are fundamental issues in a Workflow Management System

(WfMS). Effective resource management in WfMS should examine resource allocation together with task

scheduling since these problems impose mutual constraints. Optimization of the one factor is subject to the

other constraints and vice versa. Thus, an ideal algorithm should take into account not only performance

metrics of the infrastructure, such as the number of resources and their utilization, but also quality criteria such

19






Contact : 91452 4390702, 4392702,

4394702.



3rd

Floor,SI Towers,



Contact : 91431 - 4002234.





Contact : 91474 2723622.


as the percentage of tasks undergone violation in their temporal restrictions. In this paper, we propose an

innovative algorithm which jointly optimizes the two aforementioned contradictory criteria. The algorithm, called

Resource Conflicts Joint Optimization (Re.Co.Jo.Op.), minimizes resource conflicts subject to temporal

constraints and simultaneously optimizes throughput or utilization subject to resources constraints. To achieve

the optimization, the two factors are formulated in a matrix form and the optimal solution is found by applying

concepts of the generalized eigenvalue analysis. A rough outline of an agent-based architecture is proposed to

achieve runtime integration of our algorithm into a functional WfMS, while experimental results under different

load environments and tasks assumption reveal the superiority of the proposed strategy than the other

conventional approaches.

61 Pareto-Based Dominant Graph: An Efficient Indexing Structure to Answer Top-K Queries

Given a record set D and a query score function F, a top-k query returns k records from D, whose values of

function F on their attributes are the highest. In this paper, we investigate the intrinsic connection between top-k

queries and dominant relationships between records, and based on which, we propose an efficient layer-based

indexing structure, Pareto-Based Dominant Graph (DG), to improve the query efficiency. Specifically, DG is built

offline to express the dominant relationship between records and top-k query is implemented as a graph

traversal problem, i.e., Traveler algorithm. We prove theoretically that the size of search space (that is the

number of retrieved records from the record set to answer top-k query) in our algorithm is directly related to the

cardinality of skyline points in the record set (see Theorem 3). Considering I/O cost, we propose cluster-based

storage schema to reduce I/O cost in Traveler algorithm. We also propose the cost estimation methods in this

paper. Based on cost analysis, we propose an optimization technique, pseudore cord, to further improve the

search efficiency. In order to handle the top-k query in the high-dimension record set, we also propose N-Way

Traveler algorithm. In order to handle DG maintenance efficiently, we propose “Insertion” and “Deletion”

algorithms for DG. Finally, extensive experiments demonstrate that our proposed methods have significant

improvement over its counterparts, including both classical and state art of top-k algorithms.

62 Privacy-Preserving OLAP: An Information-Theoretic Approach

We address issues related to the protection of private information in Online Analytical Processing (OLAP) systems,

where a major privacy concern is the adversarial inference of private information from OLAP query answers. Most

previous work on privacypreserving OLAP focuses on a single aggregate function and/or addresses only exact

disclosure, which eliminates from consideration an important class of privacy breaches where partial information,

but not exact values, of private data is disclosed (i.e., partial disclosure). We address privacy protection against

both exact and partial disclosure in OLAP systems with mixed aggregate functions. In particular, we propose an

information-theoretic inference control approach that supports a combination of common aggregate functions

(e.g., COUNT, SUM, MIN, MAX, and MEDIAN) and guarantees the level of privacy disclosure not to exceed

thresholds predetermined by the data owners. We demonstrate that our approach is efficient and can be

implemented in existing OLAP systems with little modification. It also satisfies the simulatable auditing model and

leaks no private information through query rejections. Through performance analysis, we show that compared with

previous approaches, our approach provides more effective privacy protection while maintaining a higher level of

query-answer availability.

63 Ranking Spatial Data by Quality Preferences

Abstract—A spatial preference query ranks objects based on the qualities of features in their spatial neighborhood.

For example, using a real estate agency database of flats for lease, a customer may want to rank the flats with

20






Contact : 91452 4390702, 4392702,

4394702.



3rd

Floor,SI Towers,



Contact : 91431 - 4002234.





Contact : 91474 2723622.


respect to the appropriateness of their location, defined after aggregating the qualities of other features (e.g.,

restaurants, cafes, hospital, market, etc.) within their spatial neighborhood. Such a neighborhood concept can be

specified by the user via different functions. It can be an explicit circular region within a given distance from the

flat. Another intuitive definition is to assign higher weights to the features based on their proximity to the flat. In

this paper, we formally define spatial preference queries and propose appropriate indexing techniques and search

algorithms for them. Extensive evaluation of our methods on both real and synthetic data reveals that an optimized

branch-and-bound solution is efficient and robust with respect to different parameters.

64 RFID Data Processing in Supply Chain Management Using a Path Encoding Scheme

Abstract—RFID technology can be applied to a broad range of areas. In particular, RFID is very useful in the area of

business, such as supply chain management. However, the amount of RFID data in such an environment is huge.

Therefore, much time is needed to extract valuable information from RFID data for supply chain management. In

this paper, we present an efficient method to process a massive amount of RFID data for supply chain

management. We first define query templates to analyze the supply chain. We then propose an effective path

encoding scheme that encodes the flows of products. However, if the flows are long, the numbers in the path

encoding scheme that correspond to the flows will be very large. We solve this by providing a method that divides

flows. To retrieve the time information for products efficiently, we utilize a numbering scheme for the XML area.

Based on the path encoding scheme and the numbering scheme, we devise a storage scheme that can process

tracking queries and path oriented queries efficiently on an RDBMS. Finally, we propose a method that translates

the queries to SQL queries. Experimental results show that our approach can process the queries efficiently.

65 Seeking Quality of Web Service Composition in a Semantic Dimension

Abstract—Ranking and optimization of web service compositions represent challenging areas of research with

significant implications for the realization of the “Web of Services” vision. “Semantic web services” use formal

semantic descriptions of web service functionality and interface to enable automated reasoning over web service

compositions. To judge the quality of the overall composition, for example, we can start by calculating the

semantic similarities between outputs and inputs of connected constituent services, and aggregate these values

into a measure of semantic quality for the composition. This paper takes a specific interest in combining semantic

and nonfunctional criteria such as quality of service (QoS) to evaluate quality in web services composition. It

proposes a novel and extensible model balancing the new dimension of semantic quality (as a functional quality

metric) with a QoS metric, and using them together as ranking and optimization criteria. It also demonstrates the

utility of Genetic Algorithms to allow optimization within the context of a large number of services foreseen by the

“Web of Services” vision. We test the performance of the overall approach using a set of simulation experiments,

and discuss its advantages and weaknesses.

66 Selecting Attributes for Sentiment Classification Using Feature Relation Networks

A major concern when incorporating large sets of diverse n-gram features for sentiment classification is the presence

of noisy, irrelevant, and redundant attributes. These concerns can often make it difficult to harness the augmented

discriminatory potential of extended feature sets. We propose a rule-based multivariate text feature selection method

called Feature Relation Network (FRN) that considers semantic information and also leverages the syntactic

relationships between n-gram features. FRN is intended to efficiently enable the inclusion of extended sets of

heterogeneous n-gram features for enhanced sentiment classification. Experiments were conducted on three online

review testbeds in comparison with methods used in prior sentiment classification research. FRN outperformed the

21






Contact : 91452 4390702, 4392702,

4394702.



3rd

Floor,SI Towers,



Contact : 91431 - 4002234.





Contact : 91474 2723622.


comparison univariate, multivariate, and hybrid feature selection methods; it was able to select attributes resulting in

significantly better classification accuracy irrespective of the feature subset sizes. Furthermore, by incorporating

syntactic information about n-gram relations, FRN is able to select features in a more computationally efficient manner

than many multivariate and hybrid techniques..

67 Semantic Knowledge-Based Framework to Improve the Situation Awareness of Autonomous Underwater Vehicles

This paper proposes a semantic world model framework for hierarchical distributed representation of knowledge in

autonomous underwater systems. This framework aims to provide a more capable and holistic system, involving

semantic interoperability among all involved information sources. This will enhance interoperability, independence of

operation, and situation awareness of the embedded service-oriented agents for autonomous platforms. The results

obtained specifically affect the mission flexibility, robustness, and autonomy. The presented framework makes use of

the idea that heterogeneous real-world data of very different type must be processed by (and run through) several

different layers, to be finally available in a suited format and at the right place to be accessible by high-level decision-

making agents. In this sense, the presented approach shows how to abstract away from the raw real-world data step

by step by means of semantic technologies. The paper concludes by demonstrating the benefits of the framework in a

real scenario. A hardware fault is simulated in a REMUS 100 AUV while performing a mission. This triggers a

knowledge exchange between the status monitoring agent and the adaptive mission planner embedded agent. By

using the proposed framework, both services can interchange information while remaining domain independent during

their interaction with the platform. The results of this paper are readily applicable to land and air robotics.

68 Straggler Identification in Round-Trip Data Streams via Newton’s Identities and Invertible Bloom Filters

Abstract—In this paper, we study the straggler identification problem, in which an algorithm must determine the

identities of the remaining members of a set after it has had a large number of insertion and deletion operations

performed on it, and now has relatively few remaining members. The goal is to do this in oðnÞ space, where n is the

total number of identities. Straggler identification has applications, for example, in determining the unacknowledged

packets in a high-bandwidth multicast data stream. We provide a deterministic solution to the straggler identification

problem that uses only Oðd log nÞ bits, based on a novel application of Newton’s identities for symmetric

polynomials. This solution can identify any subset of d stragglers from a set of nOðlog nÞ-bit identifiers, assuming

that there are no false deletions of identities not already in the set. Indeed, we give a lower bound argument that shows

that any small-space deterministic solution to the straggler identification problem cannot be guaranteed to handle

false deletions. Nevertheless, we provide a simple randomized solution, using Oðd log n logð1=TÞÞ bits that can

maintain a multiset and solve the straggler identification problem, tolerating false deletions, where T > 0 is a user-

defined parameter bounding the probability of an incorrect response. This randomized solution is based on a new type

of Bloom filter, which we call the invertible Bloom filter.

69 SwiftRule: Mining Comprehensible Classification Rules for Time Series Analysis

Abstract—In this article, we provide a new technique for temporal data mining which is based on classification rules

that can easily be understood by human domain experts. Basically, time series are decomposed into short segments,

and short-term trends of the time series within the segments (e.g., average, slope, and curvature) are described by

means of polynomial models. Then, the classifiers assess short sequences of trends in subsequent segments with

their rule premises. The conclusions gradually assign an input to a class. As the classifier is a generative model of the

processes from which the time series are assumed to originate, anomalies can be detected, too. Segmentation and

piecewise polynomial modeling are done extremely fast in only one pass over the time series. Thus, the approach is

applicable to problems with harsh timing constraints. We lay the theoretical foundations for this classifier, including a

22






Contact : 91452 4390702, 4392702,

4394702.



3rd

Floor,SI Towers,



Contact : 91431 - 4002234.





Contact : 91474 2723622.


new distance measure for time series and a new technique to construct a dynamic classifier from a static one, and

demonstrate its properties by means of various benchmark time series, for example, Lorenz attractor time series,

energy consumption in a building, or ECG data.

70 Temporal Data Clustering via Weighted Clustering Ensemble with Different Representations

Temporal data clustering provides underpinning techniques for discovering the intrinsic structure and condensing

information over temporal data. In this paper, we present a temporal data clustering framework via a weighted

clustering ensemble of multiple partitions produced by initial clustering analysis on different temporal data

representations. In our approach, we propose a novel weighted consensus function guided by clustering validation

criteria to reconcile initial partitions to candidate consensus partitions from different perspectives, and then, introduce

an agreement function to further reconcile those candidate consensus partitions to a final partition. As a result, the

proposed weighted clustering ensemble algorithm provides an effective enabling technique for the joint use of

different representations, which cuts the information loss in a single representation and exploits various information

sources underlying temporal data. In addition, our approach tends to capture the intrinsic structure of a data set, e.g.,

the number of clusters. Our approach has been evaluated with benchmark time series, motion trajectory, and time-

series data stream clustering tasks. Simulation results demonstrate that our approach yields favorite results for a

variety of temporal data clustering tasks. As our weighted cluster ensemble algorithm can combine any input

partitions to generate a clustering ensemble, we also investigate its limitation by formal analysis and empirical studies.

71 TEXT: Automatic Template Extraction from Heterogeneous Web Pages

World Wide Web is the most useful source of information. In order to achieve high productivity of publishing, the

webpages in many websites are automatically populated by using the common templates with contents. The

templates provide readers easy access to the contents guided by consistent structures. However, for machines,

the templates are considered harmful since they degrade the accuracy and performance of web applications due

to irrelevant terms in templates. Thus, template detection techniques have received a lot of attention recently to

improve the performance of search engines, clustering, and classification of web documents. In this paper, we

present novel algorithms for extracting templates from a large number of web documents which are generated

from heterogeneous templates. We cluster the web documents based on the similarity of underlying template

structures in the documents so that the template for each cluster is extracted simultaneously. We develop a

novel goodness measure with its fast approximation for clustering and provide comprehensive analysis of our

algorithm. Our experimental results with real-life data sets confirm the effectiveness and robustness of our

algorithm compared to the state of the art for template detection algorithms.

72 Text Clustering with Seeds Affinity Propagation

Abstract—Based on an effective clustering algorithm—Affinity Propagation (AP)—we present in this paper a

novel semisupervised text clustering algorithm, called Seeds Affinity Propagation (SAP). There are two main

contributions in our approach: 1) a new similarity metric that captures the structural information of texts, and 2)

a novel seed construction method to improve the semisupervised clustering process. To study the performance

of the new algorithm, we applied it to the benchmark data set Reuters-21578 and compared it to two state-of-the-

art clustering algorithms, namely, k-means algorithm and the original AP algorithm. Furthermore, we have

analyzed the individual impact of the two proposed contributions. Results show that the proposed similarity

metric is more effective in text clustering (F-measures ca. 21 percent higher than in the AP algorithm) and the

proposed semisupervised strategy achieves both better clustering results and faster convergence (using only 76

percent iterations of the original AP). The complete SAP algorithm obtains higher F-measure (ca. 40 percent

23






Contact : 91452 4390702, 4392702,

4394702.



3rd

Floor,SI Towers,



Contact : 91431 - 4002234.





Contact : 91474 2723622.


improvement over k-means and AP) and lower entropy (ca. 28 percent decrease over k-means and AP), improves

significantly clustering execution time (20 times faster) in respect that k-means, and provides enhanced

robustness compared with all other methods.

73 The CoQUOS Approach to Continuous Queries in Unstructured Overlays

Abstract—The current peer-to-peer (P2P) content distribution systems are constricted by their simple on-

demand content discovery mechanism. The utility of these systems can be greatly enhanced by incorporating

two capabilities, namely a mechanism through which peers can register their long term interests with the

network so that they can be continuously notified of new data items, and a means for the peers to advertise their

contents. Although researchers have proposed a few unstructured overlay-based publishsubscribe systems that

provide the above capabilities, most of these systems require intricate indexing and routing schemes, which not

only make them highly complex but also render the overlay network less flexible toward transient peers. This

paper argues that for many P2P applications, implementing full-fledged publish-subscribe systems is an overkill.

For these applications, we study the alternate continuous query paradigm, which is a best-effort service

providing the above two capabilities. We present a scalable and effective middleware, called CoQUOS, for

supporting continuous queries in unstructured overlay networks. Besides being independent of the overlay

topology, CoQUOS preserves the simplicity and flexibility of the unstructured P2P network. Our design of the

CoQUOS system is characterized by two novel techniques, namely cluster-resilient random walk algorithm for

propagating the queries to various regions of the network and dynamic probability-based query registration

scheme to ensure that the registrations are well distributed in the overlay. Further, we also develop effective and

efficient schemes for providing resilience to the churn of the P2P network and for ensuring a fair distribution of

the notification load among the peers. This paper studies the properties of our algorithms through theoretical

analysis. We also report series of experiments evaluating the effectiveness and the costs of the proposed

schemes.

74 The World in a Nutshell: Concise Range Queries

With the advance of wireless communication technology, it is quite common for people to view maps or get

related services from the handheld devices, such as mobile phones and PDAs. Range queries, as one of the

most commonly used tools, are often posed by the users to retrieve needful information from a spatial database.

However, due to the limits of communication bandwidth and hardware power of handheld devices, displaying all

the results of a range query on a handheld device is neither communicationefficient nor informative to the users.

This is simply because that there are often too many results returned from a range query. In view of this

problem, we present a novel idea that a concise representation of a specified size for the range query results,

while incurring minimal information loss, shall be computed and returned to the user. Such a concise range

query not only reduces communication costs, but also offers better usability to the users, providing an

opportunity for interactive exploration. The usefulness of the concise range queries is confirmed by comparing

it with other possible alternatives, such as sampling and clustering. Unfortunately, we prove that finding the

optimal representation with minimum information loss is an NP-hard problem. Therefore, we propose several

effective and nontrivial algorithms to find a good approximate result. Extensive experiments on real-world data

have demonstrated the effectiveness and efficiency of the proposed techniques.

75 Usher: Improving Data Quality with Dynamic Forms

Data quality is a critical problem in modern databases. data-entry forms present the first and arguably best

opportunity for detecting and mitigating errors, but there has been little research into automatic methods for

24






Contact : 91452 4390702, 4392702,

4394702.



3rd

Floor,SI Towers,



Contact : 91431 - 4002234.





Contact : 91474 2723622.


improving data quality at entry time. In this paper, we propose USHER, an end-to-end system for form design,

entry, and data quality assurance. Using previous form submissions, USHER learns a probabilistic model over

the questions of the form. USHER then applies this model at every step of the data-entry process to improve

data quality. Before entry, it induces a form layout that captures the most important data values of a form

instance as quickly as possible and reduces the complexity of error-prone questions. During entry, it

dynamically adapts the form to the values being entered by providing real-time interface feedback, reasking

questions with dubious responses, and simplifying questions by reformulating them. After entry, it revisits

question responses that it deems likely to have been entered incorrectly by reasking the question or a

reformulation thereof. We evaluate these components of USHER using two real-world data sets. Our results

demonstrate that USHER can improve data quality considerably at a reduced cost when compared to current

practice.

25

ieee final year projects 2011-2012 :: elysium technologies pvt ltd::knowledge and data engineering...

Education

parallel algorithm

sequential algorithm

user utility

selfconstructing algorithm

arbitrary number of

training data

fast multiple

mlcs problem