type:research paper authors: david vengerov, andre cavalheiro menck, mohamed zait, sunil p....
TRANSCRIPT
Join Size Estimation Subject to Filter
ConditionsType: Research Paper
Authors: David Vengerov, Andre Cavalheiro Menck, Mohamed Zait, Sunil P. Chakkappen (Oracle)
Presented by: Siddhant Kulkarni
Term: Fall 2015
Motivation
Query Cost! Estimating “Size” of the resultant table if a join is
performed on two or more tables Accommodating the predicates
Related Work
Related work of this paper focuses on:Sampling techniques such as Bifocal Sampling, End-
Biased SamplingSketch based methods for join size estimation
Contribution
Correlated Sampling A novel sketch-based join size estimation
Reformulation based query answering in RDF: alternatives
and performanceType: Demonstration Paper
Authors: Damian Bursztyn(University of Paris-sud), Francois Goasdoue(University of
Rennes), Ioana Manolescu(University of Paris-sud)
Presented by: Siddhant Kulkarni
Term: Fall 2015
Introduction
RDF = Resource Description Framework You can query this data!
Processing queries and displaying results is called query answering!
Motivation
Two ways of query answering Saturation based (SAT) Reformulation based (REF) – PROBLEM!
Showcase large set of REF techniques (along with one they have presented in another paper)
Contribution
What the demo system allows you to do: Pick RDF graph and visualize it statistics Choose a query and answering method Observe evaluation in real time Modify RDF data and reevaluate
Error Diagnosis and Data Profiling with Data X-Ray
Presented by: Omar Alqahtani
Fall 2015
Xiaolan Wang Mary Feng Yue Wang Xin Luna Dong Alexandra Meliou University of Massachusetts University of Iowa Google Inc.
Motivation
Retrieving high quality datasets from voluminous and diverse sources is crucial for many data-intensive applications.
However, the retrieved datasets often contain noise and other discrepancies.
Traditional data cleaning tools mostly trying to answer “Which data is incorrect?”
Conribution
Demonstration for DATAXRAY, a general-purpose, highly- scalable tool.
It explains why and how errors happen in a data generative process
It answers: Why are there errors in the data? or
How can I prevent further errors?
It finds groupings of errors that may be due to the same cause. But how: it identifies these groups based on their common characteristics ( features )
The Core of Data-Xray
Features are organized in a hierarchical structure based on simple containment relationships.
DATAXRAY uses a top-down algorithm to explore the feature hierarchy: To identify the set of features that best summarize all erroneous data
elements. It uses a cost function based on Bayesian analysis to derive the set of features
with the highest probability of being associated with the causes for the mistakes in a dataset.
WADaR: Joint Wrapper and Data Repair
Author: Stefano Ortona, Giorgio Orsi, Marcello Buoncristiano, and Tim Furche Department of Computer Science, Oxford University, United Kingdom
Dipartimento di Matematica, Informatica ed Economia, Universit`a della Basilicata, Italy [email protected]
Paper Type: Demo
Presented by:
Ranjan_KY
Fall 2015
Motivation
Web scrapping (or wrapping) is a popular means for acquiring data from the web.
Today generation made scalable wrapper-generation possible and enabled data acquisition process involving thousands of sources.
No scalable tools exists that support these task.
.
Problem Modern wrapper-generation systems leverage a number of features
ranging from HTML and visual structures to knowledge bases and micro-data.
Nevertheless, automatically-generated wrappers often suffer from errors resulting in under/over segmented data, together with missing or spurious content.
Under and over segmentation of attributes are commonly caused by irregular HTML markups or by multiple attributes occurring within the same DOM node.
Incorrect column types are instead associated with the lack of domain knowledge, supervision, or micro-data during wrapper generation.
The degraded quality of the generated relations argues for means to repair both the data and the corresponding wrapper so that future wrapper executions can produce cleaner data
Demonstration
WADaR takes as input a (possibly incorrect) wrapper and a target relation schema, and iteratively repairs both the generated relations and the wrapper by observing the output of the wrapper execution.
A key observation is that errors in the extracted relations are likely to be systematic as wrappers are often generated from templated websites.
WADaR’s repair process (i) Annotating the extracted relations with standard entity
recognizers, (ii) Computing Markov chains describing the most likely
segmentation of attribute values in the records, and (iii) Inducing regular expressions which re-segment the input
relation according to the given target schema and that can possibly be encoded back into the wrapper.
Related work
In this paper, related work was not evaluated in detail
[1] M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Extraction and integration of partially overlapping web sources. PVLDB, 6(10):805–816, 2013.
[2] L. Chen, S. Ortona, G. Orsi, and M. Benedikt. Aggregating semantic annotators. PVLDB, 6(13):1486–1497, 2013.
[3] X. Chu, Y. He, K. Chakrabarti, and K. Ganjam. Tegra: Table extraction by global record alignment. In SIGMOD, pages 1713–1728. ACM, 2015.
ConfSeer: Leveraging Customer Support KnowledgeBases for Automated Misconfiguration Detection
Rahul Potharaju† , Joseph Chan† , Luhui Hu† , Cristina Nita-Rotaru ∗ ,Mingshi Wang† , Liyuan Zhang† , Navendu Jain‡†Microsoft ∗Purdue University ‡Microsoft Research
Presented by: Zohreh Raghebi
Introduction Configuration errors have a significant impact on system performance and availability
For instance, a misconfiguration in the user-authentication system caused login problems for several Google services including Gmail and Drive
A software misconfiguration in Windows Azure caused a 2.5 hour outage in 2012
Many configuration errors are due to faulty patches e.g., changed file paths causing incompatibility with other applications, empty fields in
Registry
failed uninstallations
Introduction
Unfortunately, troubleshooting misconfigurations is time consuming, hard and expensive.
First, today’s software configurations are becoming increasingly complex and large comprising hundreds of parameters and their settings
Second, many of these errors manifest as silent failures leaving users clueless:
They either search online or contact customer service and support (CSS)
loss of productivity, time and effort
Related works several research efforts have proposed techniques to identify, diagnose and fix
configuration problems
Some commercial tools are also available to manage system configurations or to automate certain configuration tasks
many of these approaches either assume the presence of a large set of configurations to apply statistical testing e.g., PeerPressure
periodically checkpoint disk state e.g., Chronus risking high overheads;
use data flow analysis e.g., ConfAid for error-tracing;
Main idea
This paper presents the design, implementation and evaluation of ConfSeer a system that aims to proactively find misconfigurations on user machines using a knowledge base (KB) of technical solutions ConfSeer focuses on addressing parameter-related misconfigurations, as they account
for a majority of user configuration errors the key idea behind ConfSeer:
to enable configuration-diagnosis-as-a-service by automatically matching configuration problems to their solutions described in free-form text
Main idea
First, ConfSeer takes the snapshots of configuration files from a user machine as input These are typically uploaded by agents running on these machines Second, it extracts the configuration parameter names and value settings from the snapshots matches them against a large set of KB articles,
which are published and actively maintainedby many vendors
Third, after a match is found, ConfSeer automatically pinpoints the configuration error with its matching KB article
so users can apply the suggested fix
Conclusion
ConfSeer is the first approach that combines traditional IR and NLP techniques (e.g., indexing, synonyms)
with new domain specific techniques (e.g., constraint evaluation, synonymexpansion with named-entity resolution)
to build an end-to end practical system to detect misconfigurations. It is part of a larger system-building effort to automatically detect software errors and misconfigurations by leveraging a broad range of data sources such as knowledge bases, technical help articles, and
question and answer forums which contain valuable yet unstructured information to perform diagnosis
Live Programming in the LogicBlox System:
A MetaLogiQL ApproachDan OlteanuLogicBlox, Inc. [email protected]
Presented by: Zohreh Raghebi
Introduction
An increasing amount of self-service enterprise applications require live programming in the database
the traditional edit-compile-run cycle is abandoned in favor of a more interactive user experience with live feedback on a program's runtime behavior
In retail-planning spreadsheets backed by scalable full-edged database systems:
users can define and change schemas of pivot tables and formulas over these schemas on the fly
These changes trigger updates to the application code on the database server
the challenge is to quickly update the user spreadsheets in response to these changes
Main idea
To achieve interactive response times in real world
changes to application code must be quickly compiled and hot-swapped into the running program
the effects of those changes must be efficiently computed in an incremental fashion
In this paper, we discuss the technical challenges in supporting live programming in the database.
Main idea The workhorse architectural component is a “meta-engine"
Incrementally maintains metadata representing application code
guides its compilation into an internal representation in the database kernel
orchestrates maintenance of materialized results of the application code based on those changes
In contrast, the engine proper works on application data and can incrementally maintain materialized results in the face of data updates.
The meta-engine instructs the engine which materialized results need to be (partially or completely) recomputed
Without the meta-engine, the engine would unnecessarily recompute from scratch all materialized results
every time the application code changes
render the system unusable for live programming
Main idea present the meta-engine solution that implemented in the LogicBlox commercial system
LogicBlox offers a unified runtime for the enterprise software stack
LogicBlox applications are written in an extension of Datalog called LogiQL
LogiQL acts as a declarative programming model unifying OLTP, OLAP, and prescriptive and predictive analytics
It offers rich language constructs for expressing derivation rules
the meta-engine uses rules expressed in a Datalog-like language called MetaLogiQL
these operate on metadata representing LogiQL rules
Outside of the database context:
our design may even provide a novel means of building incremental compilers
for general-purpose programming languages
Annotating Database Schemas to Help Enterprise Search
Presented by: Dardan Xhymshiti
Fall 2015
Authors: Eli Cortez, Philip A.Bernstein, Yeye He, Lev Novik (Microsoft Corporation)
Conference: VLDB Type: Demonstration
Major problem Data discovery of relevant information in relational databases. Problem of generating reports. To find relevant information, users have to find the database tables that are
relevant to the task, for each of them understand its content to determine whether it is truly relevant etc.
The schema’s table and column names are often not very descriptive of the content.
Example: In 412 data columns in 639 tables from 29 databases used by Microsoft’s IT organization, 28% of all columns were very generic as:
name, it, description, field, code
A typical corporate database table with generic column names
Such non-descriptive column names make it difficult to search and understand the table.
One solution: using data stewards to enrich the database tables and columns with textual description.Time consumingIgnoring databases that are less frequent.
Major Contribution Barcelos: automatically annotate columns of database tables. Annotate even those tables that are not frequent. How it works?
It works by mining spreadsheets. Many of these spreadsheets are generated by queries. It uses spreadsheet’s column names as candidate annotations for the
corresponding database columns. For the table above Barcelos produces annotations:
TeamID and Team for the first column, Delivery Team and Team for the second column. Line of Business and Business for the third.
Conclusion
The authors have provided a method to for extracting relevant tables from an enterprise database.
A method for identifying and ranking relevant column annotations.
An implementation of Barcelos and an experimental evaluation that shows its efficiency and effectiveness.
Query Optimization in 12c Database In-Memory
Presented by: Dardan Xhymshiti
Fall 2015
Authors: Dinesh Das, Jiaqi Yan, Mohamed Zait, Satyanarayana R Valluri, Nirav Vyas, Ramarajan Krishnamachari, Prashant Gaharwar, Jesse Kamp, Nioly Mukherjee
Conference: VLDB Type: Industry paper
Problem and Motivation Database In-Memory (column-oriented) Database On-Disk (row-oriented)
Oracle 12C Database In-memory Industry’s first dual format database (In-Memory & On-Disk)
Problem: optimization of query processing Optimization for On-Disk query processing are not efficient for In-Memory query
processing.
Motivation: Modify the query optimizer to generate execution plans optimized for the
specific format – row major or columnar - that will be scanned during query execution.
Contribution Various vendors have taken different approaches to generating
execution plans for in-memory columnar tables: Make no change to the query optimizer expecting that the queries in the
different data format would perform better. Using heuristic methods to allow the optimizer to generate different plans. Limit optimizer enhancements to specific workloads like star queries.
The authors provide a comprehensive optimizer redesign to handle a variety of workloads on databases with varied schemas and different data formats.
Related Work
Column major tables dates since 1980s. Sybase IQ.
MonetDB and C-Store around 2000s.
AIDE: An Automatic User Navigation System for
Interactive Data Exploration
Presented by: Shahab Helmi
Fall 2015
Paper InfoAuthors:
Publication: VLDB 2015
Type: Demonstration Paper
Introduction Data analysts often engage in data exploration tasks to discover interesting data
patterns, without knowing exactly what they are looking for (exploratory analysis).
Users try to make sense of the underlying data space by navigating through it. The process includes a great deal of experimentation with queries, backtracking on the basis of query results, and revision of results at various points in the process.
When data size is huge, finding the relevant sub-space and relevant results takes so long.
AIDEAIDE is an automated data exploration system that:
Steers the user towards interesting data areas based on her relevance feedback on database samples.
Aims to achieve the goal of identifying all database objects that match the user interest with high efficiency.
It relies on a combination of machine learning techniques and sample selection algorithms to provide effective data exploration results as well as high interactive performance over databases of large sizes.
Experimental ResultsDatasets:
AuctionMark: information on action items and their bids. 1.77GB.
Sloan Digital Sky Survey: This is a scientific data set generated by digital surveys of stars and galaxies. Large data size and complex schema. 1GB-100GB.
US housing and used cars: available through the DAIDEM Lab
System Implementation:
Java: ML, clustering and classification algorithms, such as SVM, k-means, decision trees
PostgreSQL
Towards Scalable Real-time Analytics: An Architecture for Scaleout of OLxP
Workloads
Presented by: Shahab Helmi
Fall 2015
Paper InfoAuthors:
Publication: VLDB 2015
Type: Industry Paper
IntroductionThis paper presents the work on the SAP HANA Scale-out Extension: a novel distributed database architecture designed to support large scale analytics over real-time data.
High performance OLAP with massive scale-out capabilities.
Concurrently allowing OLTP workloads.
New design of core database components such as query processing, concurrency control, and persistence, using high throughput low-latency networks and storage devices.
enables analytics over real-time changing data and allows ne grained user specified service level agreements (SLAs) on data freshness.
Motivation/ChallengesThere are two fundamental paradigm shifts happening in enterprise data management:
Dramatic increase in the amount of data being produced and persisted by enterprises.
Need for businesses to have analytical access to up-to-date data in order to make critical business decisions.
Enterprises want real-time insights from their data in order to make critical time sensitive business decisions -> the ETL pipelines for offline analytical processing of day, week, or even month old data do not work.
On one hand, a system must provide on-line transaction processing (OLTP) support to have real-time changes to data reflected in queries.
On the other, systems need to scale to very large data sizes and provide on-line analytical processing (OLAP) over these large and changing data sets.
Some System Features Mixed transactional and analytical workloads.
Ability to take advantage of emerging hardware: High core count processors, SIMD instructions, large processor caches, and large memory
capacities.
Storage class memories and high-bandwidth low-latency network interconnects.
Supporting cloud data storage.
Contributions Heterogeneous scale-out of OLTP and OLAP workloads.
Decoupling query processing from transaction management.
The ability to improve performance by scheduling snapshots for read-only OLAP transactions according to fine grained SLAs.
A scalable distributed log providing durability, fault tolerance, and asynchronous update dissemination to compute engines.
Support for different compute engines: e.g., SQL engines, R, Spark, graph, and text.
Related Work Mixed OLTP/OLAP: HyPer, ConuxDB.
Scale-out OLTP Systems: Calvin, H-Store.
Shared Log: CORFU, Kafka, BookKeeper
Sharing and Reproducing Database Applications
Presented by: Ashkan Malekloo
Fall 2015
Sharing and Reproducing Database Applications
Type: Demonstration paper
Authors:
VLDB 15
Quan Pham, Severin Thaler, Tanu Malik, Ian Foster, Boris Glavic
Introduction
Recently, application virtualization (AV), has emerged as a light-weight alternative for sharing and efficient repeatability
AV approaches: Linux Containers
CDE (Using System Call Interposition to Automatically Create Portable Software Packages)
Introduction
Generally, application virtualization techniques can also be applied to DB applications
These techniques treat a database system as a black-box application process
Oblivious to the query statements or database model supported by the database system.
Light-weight Database Virtualization (LDV)
Tool for creating packages of DB applications.
LDV package encapsulates: Application
Relevant dependencies
Relevant data
LDV relies on data prevalence
Contribution
Its ability to create self-contained packages of a DB application that can be shared and run on different machine configurations without the need to install a database system and setup a database
Extracting a slice of the database accessed by an application
How LDV's execution traces can be used to understand how the files, processes, SQL operations, and database content of an application are related to each other.