presented by: zohreh fall 2015. nandish jayaram university of texas at arlington sidharth goyal ...

VIIQ: AutoSuggestionEnabled Visual Interface for

Interactive Graph Query FormulationPresented by: Zohreh

Fall 2015

Authors

Nandish Jayaram

University of Texas at Arlington

Sidharth Goyal


Chengkai Li


Introduction

An unprecedented proliferation of heterogeneous graph

with thousands of node/edge types

Complex relationships in schema-less data

Query graphs are used to:

specify the query intent for such graphs

Formulating these query graphs is a daunting task

users to know a vocabulary comprised of many labels and types

Introduction

Graph query systems allow users to construct query graphs

through a visual interface

The focus of these systems is query processing

Their query formulation components are limited to a graphical platform

To add nodes and edges with ease using mouse and keyboard actions

Little help is offered to easily choose the labels

various components in a query graph

Introduction

Every time a new query component is added

Users are inundated with possibly hundreds of options

For the new component’s label, sorted alphabetically.

It is a daunting task to browse through all the options

To select the appropriate label to add

Related works

There are other querying paradigms that help users query graph data

Declarative languages like SPARQL are used to exactly specify query intent

But present a usability barrier

Simplify query formulation

Keyword search, approximate graph query and query-by-example

Cannot be used to specify users’ exact query intent

Existing systems help users specify queries either easily or exactly,

But not both

VIIQ (Visual Interface for Interactive graph Query formulation),

To easily construct various query graph components

VIIQ automatically suggests new edges and nodes to add

To a partially constructed query graph

Users can also add nodes or edges manually,

Whose labels are ranked

presented on how likely they will be of interest to the user

VIIQ is the first visual query formulation system

Actively makes ranked suggestions

Contribution

VIIQ supports two modes of operation, passive and active

By default VIIQ operates in passive mode

Based on the partially constructed query graph

the system automatically recommends top-k new edges

relevant to the user’s query intent

Fig. 3 shows the snapshot of a partially constructed query graph, with nodes and edges suggested in passive mode.

The nodes in grey and the edges incident on them are the new automatic suggestions made by the system.

Contribution

The active mode is triggered

the user adds new nodes or edges to the partial query graph

For a newly added node, the suggested labels are displayed hierarchically

In a pop-up box

For a newly added edge, the suggested edge labels are ranked

based on the likelihood of their relevance to the user’s query intent

On Uncertain Graphs Modeling and Queries

Presented by: Zohreh

Fall 2015

Authors

Arijit Khan Lei Chen† Systems Group,∗ ∗

ETH Zurich, Switzerland †

The Hong Kong University of Science and Technology

Introduction

Availability of network data have increased dramatically

Uncertainty is evident in graph data due to a variety of reasons

such as noisy measurements, inconsistent, incorrect, and possibly ambiguous information sources

In these cases, data is represented as an uncertain graph

Nodes, edges, and attributes are accompanied with a probability of existence

MODELING OF UNCERTAIN GRAPHS

Uncertainty Models Independent Probabilities:

Components in the graph independent from one another

Interprets uncertain graphs according to the well-known possible-world semantics

For example, an uncertain graph with m edges yields 2 power of m possible deterministic graphs

Correlated Probabilities

Ignores the correlations among various graph components

For example, in a traffic network:

If a road is crowded at a certain point of time

There are a few works that model such correlations with conditional probabilities

Challenges: semantic and computation

From the perspective of the semantics:

There is no uniform model of uncertain graphs;

Assignment and interpretation of the probabilities

application specific.

Define the shortest path between two nodes in an uncertain graph?

The definition could depend on the application and the specific uncertainty semantic

Challenges : computation perspective

While many graph algorithms such as subgraph isomorphism are intrinsically hard problems,

Even the simplest graph algorithms such as reachability and shortest path queries become #P-complete;

More expensive over uncertain graphs

Therefore, exact computation is almost infeasible

with today’s large-scale graph data

Focus now-a-days is towards designing approximation algorithms

With efficient sampling, indexing, and filtering strategies

MAJOR OPEN PROBLEMS

An exact computation is infeasible

Over large scale uncertain graphs,

It is important to identify the application areas

e.g. efficiency vs. effectiveness

To re-define the semantics of many classical graph operations

e.g., centrality measure and graph partitioning

A Demonstration of TripleProv:Tracking and Querying Provenance over

Web Data

Presented by: Ashkan Malekloo

Fall 2015

Sharing and Reproducing Database Applications

Type: Demonstration paper

Authors:

VLDB 15

Marcin Wylot , Philippe Cudre �-Mauroux, Paul Groth

Introduction

Heterogeneity of RDF data

Ease of integration

Examples:

one may want to analyze which sources were instrumental in providing results

How data sources were combined

Filtering the result

Introduction

Find me all the titles of articles about “Obama”but derive the answer only from sources attributed to “US News”.

Introduction

No current triple store is able to automatically derive provenance data for the results it produces or to tailor queries with provenance data.

Storing Quadruples

Named Graphs

TripleProv

A new RDF database system supporting the transparent and automatic derivation of detailed provenance information for arbitrary queries and the execution of queries incorporating provenance predicates

It is based on a native RDF store

Enables to trace provenance at two different granularity levels

SAASFEE: Scalable Scientific Workflow Execution Engine

Marc Bux Jo rgen Brandt ̈� Carsten Lipka

Kamal Hakimzadeh Jim Dowling Ulf Leser

Demonstration Paper

2015 VLDB

Presented by: Omar Alqahtani

Fall 2015

Motivation

Scientific data is analyzed by complex pipelines composed of highly specialized, domain-dependent tools.

SWfMSs facilitate the design, implementation, execution, optimization, monitoring, and exchange of such heterogeneous pipelines.

Existing SWfMSs

Roughly divided into three groups:

Taverna, Kepler, Galaxy

Askalon, Pegasus

YARN, MESOS

No platform capable of:

Embrace the ever-evolving research tools

Scaling to very large data sets

Executing arbitrarily complex workflows.

SAASFEE

It is a SWfMS which runs arbitrarily complex workflows on Hadoop YARN.

SAASFEE workflows are specified in Cuneiform.

Cuneiform workflows are executed on Hi-WAY.

Capabilities

The ability to execute iterative workflows,

An adaptive task scheduler,

Re-executable provenance traces,

Compatibility to selected other workflow systems

Tutorial: SQL-on-Hadoop Systems

Presented by:

Ranjan

Fall 2015

Why SQL? Data exploration

Structured data

organization of the data in tables

optimized data access

Declarative data processing

No need to have developer skills

Portable – universal language

SQL drivers supported

No need of Hadoop client installation

Easier integration with the current systems

Hadoop overview

These factors complicate the query optimization further in the Hadoop system

First, in the world of Hadoop and HDFS data, complex data types, such as arrays, maps, structs, as well as JSON data are more prevalent.

Second, the users utilize UDFs (user-defined-functions) very widely to express their business logic, which is sometimes very awkward to express in SQL itself.

Third, often times there is little control over HDFS. Files can be added or modified outside the tight control of a query engine, making statistics maintenance a challenge.

Hive The first SQL-on-Hadoop offering that provided an SQL-like query language, called HiveQL,

and used MapReduce run-time to execute queries.

Hadapt Hadapt, which spun out of the HadoopDB research project, was the first commercial SQL-on-

Hadoop offering. Hadapt and HadoopDB replaced the file-oriented HDFS storage formats with DBMS-oriented storage, including column-store.

Spark Spark is a fast, general purpose cluster computing engine that is compatible with

Hadoop data and tries to address the shortcomings of MapReduce. Systems that use Spark as their run-time for SQL processing:

-Shark , Hive on Spark , and Spark SQL.

Cloudera Impala

fully-integrated MPP SQL query engine.

Impala reads at almost disk bandwidth and is typically able to saturate all available disks.

IBM Big SQL

It leverages IBM’s state-of-the-art relational database technology, to processes standard SQL queries over HDFS data, supporting all common Hadoop file formats, without introducing any propriety formats.

Apache Drill

providing SQL-like declarative processing over self-describing semi-structured data.

Its focus is on analyzing data without imposing a fixed schema or creating tables in a catalog like Hive MetaStore.

Splice Machine

Splice Machine provides SQL support over HBase data using Apache Derby, targeting both operational as well as analytical workloads.

Phoenix Phoenix provides SQL querying over HBase via an embeddable JDBC driver built for high

performance and read/write operations

Collaborative Data Analytics with DataHub

Authors: Anant Bhardwaj, Amol Deshpande, Aaron J. Elmore, David Karger, Sam Madden, Aditya Parameswaran, Harihar Subramanyam, Eugene Wu, Rebecca Zhang.

Type: Demonstration paper

Presented by: Dardan Xhymshiti

Fall 2015

Major problem Organizations and companies collect data from various sources like:

Financial transactions,

Server logs,

Sensor data etc.

Teams and individuals inside the company want to use these dataset for extracting knowledge from them, using their home-grown tools, company tools, different programming languages, so making modifications on the data set (normalization, cleaning) and then exchanging these dataset back and forth.

Problem: collaborative data analysis. Heterogeneity of tools, diversity in skill-set of individuals and teams, difficulties

on sorting, difficulties on retrieving and versioning of the exchanged datasets.

Major motivation The authors motivate they work by providing two examples: Example 1: Expert analysis:

Members of an web advertising team want to extract knowledge from an unstructured ad-click data. They write a script for extracting the task-relevant information from the data, and store it as a separate dataset which will be shared across the team.

Problems:

Different team members may be more comfortable with a particular tool: R, Python, Awk, and use these tool to clean, normalize and summarize the dataset.

More proficient members use multiple languages for different purposes:

• Modeling in R.

• Visualization in JavaScript

• String extraction in Awl etc.

The team members manage the data set versions by recording it within a file with name: table_v1, table_v1.1 ….

Versioning is difficult to manage in case of a hundred data set versions. The final result…:

Example 2: Novice analysis: The coach and players of a football team want to study, query and visualize

their performance over the last season. Probably they are going to use a tool like Excel for storing their data set, which

have limited support on querying, cleaning, analysis or versioning. Query example: The coach wants to find all the games where a star player was

absent? Most of the team players are not proficient with data analysis tools, such as

SQL or scripting languages. Solution of the problem: Point-and-click apps. These apps offer:

Easy load, query, visualize and share results with other users without much effort.

These teams are unable to perform collaborative data analysis because of the lack of:

1. Flexible data sharing and versioning support

2. Point-and-click apps to help novice users do collaborative data analysis

3. Support for a number of data analysis languages and tools.

A tool for collaborative analysis can be used for example by genetics who want to share and collaborate on genome data with other research groups.

Major Contribution To address these problems the paper presents DataHub a unified data management

and collaboration platform for hosting, sharing, combining and collaboratively analyzing datasets.

DataHub has three main components:

1. Flexible data storage, sharing, and versioning capabilities.

a) Keeps track of all versions of dataset.

b) Enables collaborative analysis, while at the same time allows storing and retrieving these datasets at various stages of analysis.

2. App ecosystem for easy querying, cleaning, and visualization.

a) Distill: data cleaning by example tool.

b) DataQ: a query builder tool that allows user to build SQL queries by direct manipulation in graphical user interface. Interface is suitable for non-technical users.

c) Dviz: Data visualization tool.

3. Language-agnostic hooks for external data analysis.

For the team members that are proficient on different languages and libraries like: Python, R, Scala and Octave, the DataHub enable collaborative data analysis by using

Apache Thrift to translate between these languages and datasets in DataHub.

Gorilla: A Fast, Scalable, In-Memory Time Series Database

Authors: Tomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro, Qi Huang, Justin Meza, Kaushik Veeraraghavan. (Facebook, Inc)

Presented by: Dardan Xhymshiti

Fall 2015

Major problem Large-scale internet services (i,e. Facebook) must be highly available and

responsive in case of unexpected failures.

These large-scale services performs on a thousand of systems running on many thousand machines, that are located in different geographical areas.

These services also have a global audience of users.

Problems arise if there does not exist good failure monitoring systems.

Major motivation

The authors, are motivated by the previous problems, to come up with a solution that best ensures the availability and responsiveness of large-scale internet services.

Major contribution Authors present an in-memory Time Series Data Base (TSDB) called Gorilla, which in

second-basis gets measuring data points (CPU load, error rate, latency) from distributed machines, stores them in TSDB and perform queries on top of it.

Challenge: High data insertion rate, total data quantity, real-time aggregation and reliability.

Rather than storing measuring data points as individual data points, they are aggregated and then stored.

Gorilla TSDB constraints:

Writes dominate

Always be in able to take tens of millions of data point each second.

Gorilla TSDB


Writes dominate


State transitions

We want to identify the issues that arise in case of new changes happening to the system.

Gorilla TSDB


Writes dominate


State transitions


Gorilla TSDB

New software release


Writes dominate


State transitions


Gorilla TSDB


A network cut


Writes dominate


State transitions


Gorilla TSDB


A network cut

Side effect of an configuration change.


Writes dominate


State transitions


High availabilityIf a failure causes disconnections between datacenters, systems operating at these data centers must be able to write data to local TSDB machines.

Gorilla TSDB


Writes dominate


State transitions


Fault tolerance

The writes are replicated to multiple regions so in case of a datacenter failure, the data are survived.

High availabilityIf a failure causes disconnections between datacenters, systems operating at these data centers must be able to write data to local TSDB machines.

Gorilla TSDB

Traditional ACID guarantees are not a core requirement for TSDB.

The writes must succeed at all times, even in the face of disasters.

Recent data points are of higher value than older data points (knowing if a particular system is broken right now is more valuable to an operations engineer than knowing if it was broken an hour ago).

Challenge: speed of query processing, writes and reads. Solution: Replacing the disk-based database with an in-memory database.

Facts

In Spring 2015 Facebook’s monitoring system generated 12 billion data points per second.

12 billion data points per second = 1 trillion data points per day

Problem: 1 trillion data points * 16 byte = 16TB of RAM. (Too resource intensive) Solution: Using XOR based floating point compression , a data point from 16 bytes

was compressed to an average of 1.37 bytes (12x reduction on size).

Compressed Spatial Hierarchical Bitmap (cSHB) Indexes for Efficiently Processing Spatial Range

Query WorkloadsPresented by: Shahab Helmi

Fall 2015

Paper InfoAuthors:

Publication:

VLDB 2015

Type:

Research Paper

Motivation: Bitmap-Based Indexing

Bitmap indexes have been shown to be highly effective in answering queries in data warehouses and column-oriented data stores. Why?

1. Efficient implementations of the bitwise logical operations: “AND”, “OR”, and “NOT”;

2. Provide significant opportunities for compression, enabling either reduced I/O or, even, complete in-memory maintenance of large index structures.

3. Query processors can operate directly on compressed bitmaps.

Motivation: Quad-Tree Indexing

A quad-tree is a data structure used to divide a 2D region into more manageable parts. It's an extended binary tree, but instead of two child nodes it has four.

Introduction (1)

The key principle behind most indexing mechanisms is to ensure that data objects closer to each other in the data space are also closer to each other on the storage medium.

Total order in 1D space: easy

Total order in nD: complicated!

Common solution? Partitioning the space hierarchically in such a way that (R/KD trees):

Nearby points fall into the same partition.

Point pairs that are far from each other fall into different partitions.

Alternative?

Mapping the multi-dimensional data to 1D and apply indexing and partitioning on the 1D data.

Introduction (2)

Alternative?

Mapping the multi-dimensional data to 1D and apply indexing and partitioning on the 1D data such that:

Data objects closer to each other in the original space are also closer to each other on the 1D space.

Data objects further away from each other in the original space are also further away from each other on the 1D space.

How? fractal-based space-filling curves, In particular, the Peano-Hilbert curve and Z-order curve have been shown to be very effective in helping cluster nearby objects in the space.

Contribution

It is shown that bitmap-based indexing is also an effective solution for managing spatial data sets.

proposed compressed spatial hierarchical bitmap (cSHB) indexes to support spatial range queries.

converting the given 2D space into a 1D space using Z-order traversal.

For spatial query processing:

A cost model was developed.

Choosing the best nodes for query processing according to the cost model.

Contribution (2)

Contains the following 1D ranges(000010, 000011, 001000, 001001, 001010, 001011)

Related Work

Multi-Dimensional Space Partitioning

Quad-tree, BD-tree, G-Tree, and KD-tree.

R-tree and its variants (R*-tree, R+-tree, Hilbert R-tree, and others).

Space Filling Curve based Indexing

Peano-Hilbert curve: better mapping but costly.

Z-order curve: efficient (used in this paper).

Bitmap Indexes

Experimental Results

Datasets: 100 million synthetically generated data points ranging from <−180,−90> to <180, 90>.

A clustered data set from Gowalla, which contains the locations of check-ins made by users.

A clustered data set from OpenStreetMap (OSM).

Experimental Results (2)

presented by: zohreh fall 2015. nandish jayaram university of texas at arlington sidharth goyal ...

Documents

query graphintroduction

approximate graph query

users query intentfig

data query graphs

new query component

constructed query graphusers

partial query graphfor

new nodes