type:research paper authors: david vengerov, andre cavalheiro menck, mohamed zait, sunil p....

Join Size Estimation Subject to Filter

ConditionsType: Research Paper

Authors: David Vengerov, Andre Cavalheiro Menck, Mohamed Zait, Sunil P. Chakkappen (Oracle)

Presented by: Siddhant Kulkarni

Term: Fall 2015

Motivation

Query Cost! Estimating “Size” of the resultant table if a join is

performed on two or more tables Accommodating the predicates

Related Work

Related work of this paper focuses on:Sampling techniques such as Bifocal Sampling, End-

Biased SamplingSketch based methods for join size estimation

Contribution

Correlated Sampling A novel sketch-based join size estimation

Reformulation based query answering in RDF: alternatives

and performanceType: Demonstration Paper

Authors: Damian Bursztyn(University of Paris-sud), Francois Goasdoue(University of

Rennes), Ioana Manolescu(University of Paris-sud)

Presented by: Siddhant Kulkarni

Term: Fall 2015

Introduction

RDF = Resource Description Framework You can query this data!

Processing queries and displaying results is called query answering!

Motivation

Two ways of query answering Saturation based (SAT) Reformulation based (REF) – PROBLEM!

Showcase large set of REF techniques (along with one they have presented in another paper)

Contribution

What the demo system allows you to do: Pick RDF graph and visualize it statistics Choose a query and answering method Observe evaluation in real time Modify RDF data and reevaluate

Error Diagnosis and Data Profiling with Data X-Ray

Presented by: Omar Alqahtani

Fall 2015

Xiaolan Wang Mary Feng Yue Wang Xin Luna Dong Alexandra Meliou University of Massachusetts University of Iowa Google Inc.

Motivation

Retrieving high quality datasets from voluminous and diverse sources is crucial for many data-intensive applications.

However, the retrieved datasets often contain noise and other discrepancies.

Traditional data cleaning tools mostly trying to answer “Which data is incorrect?”

Conribution

Demonstration for DATAXRAY, a general-purpose, highly- scalable tool.

It explains why and how errors happen in a data generative process

It answers: Why are there errors in the data? or

How can I prevent further errors?

It finds groupings of errors that may be due to the same cause. But how: it identifies these groups based on their common characteristics ( features )

The Core of Data-Xray

Features are organized in a hierarchical structure based on simple containment relationships.

DATAXRAY uses a top-down algorithm to explore the feature hierarchy: To identify the set of features that best summarize all erroneous data

elements. It uses a cost function based on Bayesian analysis to derive the set of features

with the highest probability of being associated with the causes for the mistakes in a dataset.

WADaR: Joint Wrapper and Data Repair

Author: Stefano Ortona, Giorgio Orsi, Marcello Buoncristiano, and Tim Furche Department of Computer Science, Oxford University, United Kingdom

Dipartimento di Matematica, Informatica ed Economia, Universit`a della Basilicata, Italy [email protected]

Paper Type: Demo

Presented by:

Ranjan_KY

Fall 2015

mailto:[email protected]

Motivation

Web scrapping (or wrapping) is a popular means for acquiring data from the web.

Today generation made scalable wrapper-generation possible and enabled data acquisition process involving thousands of sources.

No scalable tools exists that support these task.

.

Problem Modern wrapper-generation systems leverage a number of features

ranging from HTML and visual structures to knowledge bases and micro-data.

Nevertheless, automatically-generated wrappers often suffer from errors resulting in under/over segmented data, together with missing or spurious content.

Under and over segmentation of attributes are commonly caused by irregular HTML markups or by multiple attributes occurring within the same DOM node.

Incorrect column types are instead associated with the lack of domain knowledge, supervision, or micro-data during wrapper generation.

The degraded quality of the generated relations argues for means to repair both the data and the corresponding wrapper so that future wrapper executions can produce cleaner data

Demonstration

WADaR takes as input a (possibly incorrect) wrapper and a target relation schema, and iteratively repairs both the generated relations and the wrapper by observing the output of the wrapper execution.

A key observation is that errors in the extracted relations are likely to be systematic as wrappers are often generated from templated websites.

WADaR’s repair process (i) Annotating the extracted relations with standard entity

recognizers, (ii) Computing Markov chains describing the most likely

segmentation of attribute values in the records, and (iii) Inducing regular expressions which re-segment the input

relation according to the given target schema and that can possibly be encoded back into the wrapper.

Related work

In this paper, related work was not evaluated in detail

[1] M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Extraction and integration of partially overlapping web sources. PVLDB, 6(10):805–816, 2013.

[2] L. Chen, S. Ortona, G. Orsi, and M. Benedikt. Aggregating semantic annotators. PVLDB, 6(13):1486–1497, 2013.

[3] X. Chu, Y. He, K. Chakrabarti, and K. Ganjam. Tegra: Table extraction by global record alignment. In SIGMOD, pages 1713–1728. ACM, 2015.

ConfSeer: Leveraging Customer Support KnowledgeBases for Automated Misconfiguration Detection

Rahul Potharaju† , Joseph Chan† , Luhui Hu† , Cristina Nita-Rotaru ∗ ,Mingshi Wang† , Liyuan Zhang† , Navendu Jain‡†Microsoft ∗Purdue University ‡Microsoft Research

Presented by: Zohreh Raghebi

Introduction Configuration errors have a significant impact on system performance and availability

For instance, a misconfiguration in the user-authentication system caused login problems for several Google services including Gmail and Drive

A software misconfiguration in Windows Azure caused a 2.5 hour outage in 2012

Many configuration errors are due to faulty patches e.g., changed file paths causing incompatibility with other applications, empty fields in

Registry

failed uninstallations

Introduction

Unfortunately, troubleshooting misconfigurations is time consuming, hard and expensive.

First, today’s software configurations are becoming increasingly complex and large comprising hundreds of parameters and their settings

Second, many of these errors manifest as silent failures leaving users clueless:

They either search online or contact customer service and support (CSS)

loss of productivity, time and effort

Related works several research efforts have proposed techniques to identify, diagnose and fix

configuration problems

Some commercial tools are also available to manage system configurations or to automate certain configuration tasks

many of these approaches either assume the presence of a large set of configurations to apply statistical testing e.g., PeerPressure

periodically checkpoint disk state e.g., Chronus risking high overheads;

use data flow analysis e.g., ConfAid for error-tracing;

Main idea

This paper presents the design, implementation and evaluation of ConfSeer a system that aims to proactively find misconfigurations on user machines using a knowledge base (KB) of technical solutions ConfSeer focuses on addressing parameter-related misconfigurations, as they account

for a majority of user configuration errors the key idea behind ConfSeer:

to enable configuration-diagnosis-as-a-service by automatically matching configuration problems to their solutions described in free-form text

Main idea

First, ConfSeer takes the snapshots of configuration files from a user machine as input These are typically uploaded by agents running on these machines Second, it extracts the configuration parameter names and value settings from the snapshots matches them against a large set of KB articles,

which are published and actively maintainedby many vendors

Third, after a match is found, ConfSeer automatically pinpoints the configuration error with its matching KB article

so users can apply the suggested fix

Conclusion

ConfSeer is the first approach that combines traditional IR and NLP techniques (e.g., indexing, synonyms)

with new domain specific techniques (e.g., constraint evaluation, synonymexpansion with named-entity resolution)

to build an end-to end practical system to detect misconfigurations. It is part of a larger system-building effort to automatically detect software errors and misconfigurations by leveraging a broad range of data sources such as knowledge bases, technical help articles, and

question and answer forums which contain valuable yet unstructured information to perform diagnosis

Live Programming in the LogicBlox System:

A MetaLogiQL ApproachDan OlteanuLogicBlox, Inc. [email protected]

Presented by: Zohreh Raghebi

Introduction

An increasing amount of self-service enterprise applications require live programming in the database

the traditional edit-compile-run cycle is abandoned in favor of a more interactive user experience with live feedback on a program's runtime behavior

In retail-planning spreadsheets backed by scalable full-edged database systems:

users can define and change schemas of pivot tables and formulas over these schemas on the fly

These changes trigger updates to the application code on the database server

the challenge is to quickly update the user spreadsheets in response to these changes

Main idea

To achieve interactive response times in real world

changes to application code must be quickly compiled and hot-swapped into the running program

the effects of those changes must be efficiently computed in an incremental fashion

In this paper, we discuss the technical challenges in supporting live programming in the database.

Main idea The workhorse architectural component is a “meta-engine"

Incrementally maintains metadata representing application code

guides its compilation into an internal representation in the database kernel

orchestrates maintenance of materialized results of the application code based on those changes

In contrast, the engine proper works on application data and can incrementally maintain materialized results in the face of data updates.

The meta-engine instructs the engine which materialized results need to be (partially or completely) recomputed

Without the meta-engine, the engine would unnecessarily recompute from scratch all materialized results

every time the application code changes

render the system unusable for live programming

Main idea present the meta-engine solution that implemented in the LogicBlox commercial system

LogicBlox offers a unified runtime for the enterprise software stack

LogicBlox applications are written in an extension of Datalog called LogiQL

LogiQL acts as a declarative programming model unifying OLTP, OLAP, and prescriptive and predictive analytics

It offers rich language constructs for expressing derivation rules

the meta-engine uses rules expressed in a Datalog-like language called MetaLogiQL

these operate on metadata representing LogiQL rules

Outside of the database context:

our design may even provide a novel means of building incremental compilers

for general-purpose programming languages

Annotating Database Schemas to Help Enterprise Search

Presented by: Dardan Xhymshiti

Fall 2015

Authors: Eli Cortez, Philip A.Bernstein, Yeye He, Lev Novik (Microsoft Corporation)

Conference: VLDB Type: Demonstration

Major problem Data discovery of relevant information in relational databases. Problem of generating reports. To find relevant information, users have to find the database tables that are

relevant to the task, for each of them understand its content to determine whether it is truly relevant etc.

The schema’s table and column names are often not very descriptive of the content.

Example: In 412 data columns in 639 tables from 29 databases used by Microsoft’s IT organization, 28% of all columns were very generic as:

name, it, description, field, code

A typical corporate database table with generic column names

Such non-descriptive column names make it difficult to search and understand the table.

One solution: using data stewards to enrich the database tables and columns with textual description.Time consumingIgnoring databases that are less frequent.

Major Contribution Barcelos: automatically annotate columns of database tables. Annotate even those tables that are not frequent. How it works?

It works by mining spreadsheets. Many of these spreadsheets are generated by queries. It uses spreadsheet’s column names as candidate annotations for the

corresponding database columns. For the table above Barcelos produces annotations:

TeamID and Team for the first column, Delivery Team and Team for the second column. Line of Business and Business for the third.

Conclusion

The authors have provided a method to for extracting relevant tables from an enterprise database.

A method for identifying and ranking relevant column annotations.

An implementation of Barcelos and an experimental evaluation that shows its efficiency and effectiveness.

Query Optimization in 12c Database In-Memory

Presented by: Dardan Xhymshiti

Fall 2015

Authors: Dinesh Das, Jiaqi Yan, Mohamed Zait, Satyanarayana R Valluri, Nirav Vyas, Ramarajan Krishnamachari, Prashant Gaharwar, Jesse Kamp, Nioly Mukherjee

Conference: VLDB Type: Industry paper

Problem and Motivation Database In-Memory (column-oriented) Database On-Disk (row-oriented)

Oracle 12C Database In-memory Industry’s first dual format database (In-Memory & On-Disk)

Problem: optimization of query processing Optimization for On-Disk query processing are not efficient for In-Memory query

processing.

Motivation: Modify the query optimizer to generate execution plans optimized for the

specific format – row major or columnar - that will be scanned during query execution.

Contribution Various vendors have taken different approaches to generating

execution plans for in-memory columnar tables: Make no change to the query optimizer expecting that the queries in the

different data format would perform better. Using heuristic methods to allow the optimizer to generate different plans. Limit optimizer enhancements to specific workloads like star queries.

The authors provide a comprehensive optimizer redesign to handle a variety of workloads on databases with varied schemas and different data formats.

Related Work

Column major tables dates since 1980s. Sybase IQ.

MonetDB and C-Store around 2000s.

AIDE: An Automatic User Navigation System for

Interactive Data Exploration

Presented by: Shahab Helmi

Fall 2015

Paper InfoAuthors:

Publication: VLDB 2015

Type: Demonstration Paper

Introduction Data analysts often engage in data exploration tasks to discover interesting data

patterns, without knowing exactly what they are looking for (exploratory analysis).

Users try to make sense of the underlying data space by navigating through it. The process includes a great deal of experimentation with queries, backtracking on the basis of query results, and revision of results at various points in the process.

When data size is huge, finding the relevant sub-space and relevant results takes so long.

AIDEAIDE is an automated data exploration system that:

Steers the user towards interesting data areas based on her relevance feedback on database samples.

Aims to achieve the goal of identifying all database objects that match the user interest with high efficiency.

It relies on a combination of machine learning techniques and sample selection algorithms to provide effective data exploration results as well as high interactive performance over databases of large sizes.

Experimental ResultsDatasets:

AuctionMark: information on action items and their bids. 1.77GB.

Sloan Digital Sky Survey: This is a scientific data set generated by digital surveys of stars and galaxies. Large data size and complex schema. 1GB-100GB.

US housing and used cars: available through the DAIDEM Lab

System Implementation:

Java: ML, clustering and classification algorithms, such as SVM, k-means, decision trees

PostgreSQL

Towards Scalable Real-time Analytics: An Architecture for Scaleout of OLxP

Workloads

Presented by: Shahab Helmi

Fall 2015

Paper InfoAuthors:

Publication: VLDB 2015

Type: Industry Paper

IntroductionThis paper presents the work on the SAP HANA Scale-out Extension: a novel distributed database architecture designed to support large scale analytics over real-time data.

High performance OLAP with massive scale-out capabilities.

Concurrently allowing OLTP workloads.

New design of core database components such as query processing, concurrency control, and persistence, using high throughput low-latency networks and storage devices.

enables analytics over real-time changing data and allows ne grained user specified service level agreements (SLAs) on data freshness.

Motivation/ChallengesThere are two fundamental paradigm shifts happening in enterprise data management:

Dramatic increase in the amount of data being produced and persisted by enterprises.

Need for businesses to have analytical access to up-to-date data in order to make critical business decisions.

Enterprises want real-time insights from their data in order to make critical time sensitive business decisions -> the ETL pipelines for offline analytical processing of day, week, or even month old data do not work.

On one hand, a system must provide on-line transaction processing (OLTP) support to have real-time changes to data reflected in queries.

On the other, systems need to scale to very large data sizes and provide on-line analytical processing (OLAP) over these large and changing data sets.

Some System Features Mixed transactional and analytical workloads.

Ability to take advantage of emerging hardware: High core count processors, SIMD instructions, large processor caches, and large memory

capacities.

Storage class memories and high-bandwidth low-latency network interconnects.

Supporting cloud data storage.

Contributions Heterogeneous scale-out of OLTP and OLAP workloads.

Decoupling query processing from transaction management.

The ability to improve performance by scheduling snapshots for read-only OLAP transactions according to fine grained SLAs.

A scalable distributed log providing durability, fault tolerance, and asynchronous update dissemination to compute engines.

Support for different compute engines: e.g., SQL engines, R, Spark, graph, and text.

Related Work Mixed OLTP/OLAP: HyPer, ConuxDB.

Scale-out OLTP Systems: Calvin, H-Store.

Shared Log: CORFU, Kafka, BookKeeper

Sharing and Reproducing Database Applications

Presented by: Ashkan Malekloo

Fall 2015

Sharing and Reproducing Database Applications

Type: Demonstration paper

Authors:

VLDB 15

Quan Pham, Severin Thaler, Tanu Malik, Ian Foster, Boris Glavic

Introduction

Recently, application virtualization (AV), has emerged as a light-weight alternative for sharing and efficient repeatability

AV approaches: Linux Containers

CDE (Using System Call Interposition to Automatically Create Portable Software Packages)

Introduction

Generally, application virtualization techniques can also be applied to DB applications

These techniques treat a database system as a black-box application process

Oblivious to the query statements or database model supported by the database system.

Light-weight Database Virtualization (LDV)

Tool for creating packages of DB applications.

LDV package encapsulates: Application

Relevant dependencies

Relevant data

LDV relies on data prevalence

Contribution

Its ability to create self-contained packages of a DB application that can be shared and run on different machine configurations without the need to install a database system and setup a database

Extracting a slice of the database accessed by an application

How LDV's execution traces can be used to understand how the files, processes, SQL operations, and database content of an application are related to each other.

type:research paper authors: david vengerov, andre cavalheiro menck, mohamed zait, sunil p....

Documents

join conditions

data xray

sampled data

data profiling

explicit data

size estimation subject

joine size estimation

dataintensive applications