presented by: zohreh raghebi fall 2015. konstantinos xirogiannopoulos udayan khurana amol deshpande...

GraphGen: Exploring Interesting Graphs in Relational Data

Presented by: Zohreh Raghebi

Fall 2015

Authurs

Konstantinos Xirogiannopoulos Udayan Khurana Amol Deshpande

University of Maryland, College Park; fkostasx | udayan | [email protected]

Introduction

Analyzing the interconnection structure among the underlying entities can provide significant insights and value in many application domains such as: social networks, communication networks

There is an increasing interest in executing a wide variety of graph analysis tasks and graph algorithms (e.g. community detection, influence propagation, network evolution, anomaly detection,

centrality analysis, etc.)

Introduction

This has led to the development of many specialized graph databases (e.g., Neo4j, Titan, OrientDB, etc.)

and graph execution engines (e.g., Apache Giraph, GraphLab,

Ligra, Galois, GraphX, XStream, to name a few)

Recently several researchers have also investigated the possibility of:

executing graph analysis tasks using a relational database system (e.g., Vertexica , GRAIL , Aster Graph Analytics

Although such specialized graph data management systems have made significant advances in analyzing graph-structured data,

a large fraction of the data of interest resides in relational database systems

Main Idea

We are building a system, called GRAPHGEN, with the goal:

to make it easy for users to extract a variety of different types of graphs from a relational database

and execute graph analysis tasks over them in memory

GRAPHGEN supports an expressive Domain Specific Language (DSL), based on Datalog,

To specify graph(s) to be extracted from the relational data.

GRAPHGEN has fundamentally different goals than the recent work on graph analytics using relational databases

e.g., Vertexica GRAIL , Aster Graph Analytics , SQL Serverbased

Related works

In Vertexica and GRAIL, a graph is normalized and stored in the relational database,

Those works do not consider the problem of extracting graphs from relational data,

can only execute analysis tasks that can be written using the vertex-centric programming framework

GRAPHGEN, on the other hand, pushes some computation to the relational engine,

most of the complex graph algorithms are executed in memory on a graph representation of the data

This allows GRAPHGEN to execute more complex analysis tasks like:

community detection, dense subgraph detection, matching, etc., as long as the extracted graph fits in memory

Related works

Ringo has somewhat similar goals to GraphGen and provides operators:

to convert from in-memory relational table representation to graph reprsentation

however it does not provide an expressive declarative DSL for graph extraction

and does not consider the optimizations to reduce the memory requirements

Ringo does support a large library of built-in graph algorithms

and plan to support Ringo as a frontend analytics engine for our system

TreeScope: Finding Structural Anomalies In SemiStructured Data

Presented by: Zohreh Raghebi

Fall 2015

Authurs

Shanshan Ying Advanced Digital Sciences [email protected]

Flip KornGoogle [email protected]

Barna SahaUniversity of Massachusetts [email protected]

Divesh SrivastavaAT&T Labs–[email protected]

Introduction Semi-structured data are prevalent on the web and in NoSQL document databases

with formats such as XML (eXtensible Markup Language) and JSON (JavaScript Object Notation)

popularity due to their generality, flexibility and easy customization

However, these benefits come at the cost of being prone to:

a range of data quality errors, from errors in content to errors in structure

Errors in content have been well studied in the literature

very little attention has been paid to errors in structure

Motivation

This is based on the assumption:

once data are valid according to the specified schema (DTD or XSD for XML data)

there can be no errors in their structure

We have found this assumption to often be incorrect

we observe that DTD/XSD specifications for heterogeneous XML data sets tend to be quite liberal

allowing semantically incorrect (though syntactically valid) data to creep into the data sets

The existence of such errors can lead to incorrect results on queries

and even worse result in poor data-driven decisions

illustrative examples of such errors in the well-known and widely-used DBLP Computer Science bibliography data set

Main Idea

In this work, we present TREESCOPE, to analyze semi-structured data sets

with the goal of automatically identifying potential structural errors in the data

A key insight is that it is not necessary to learn precise schema to identify structural errors

it is sufficient to learn robust structural models of subsets of the semi-structured data with high support

and identify structural anomalies as violations of the learned models

Main Idea

TREESCOPE learns robust structural models:

through a controlled exploration of the lattice structure of context path expressions

have high support, computing frequency distributions of candidate target tags

find those structural models that:

exhibit a significant skew in their frequency distributions

A Time Machine for Information: Looking Back to Look Forward

Xin Luna Dong Wang-Chiew Tan

Google Inc. UC Santa Cruz

Presented by: Omar Alqahtani

Fall 2015

Motivation

To develop a complete understanding of the history of an entity To depict trends over time. Difficult: why?

The lack of explicit temporal data. The lack of tools for interpreting such data. Many of the challenges that occur in data integration and

knowledge curation. Making every step of the data integration process time-aware.

Example

To illustrate the knowledge management techniques limitations: Query on google search: “Google CEO in 2015” Google search returns a speech by the “ex-CEO”

Another Example: “Google’s CEO before Larry Page” Google search returns articles about Larry Page

Time Machine for Information

Any one can easily and incrementally ingest temporal data to : To form more comprehensive understanding of entities over

time, To search and query facts for a particular time period, To understand trending patterns over time, and To perform analytics

A Demonstration of the BigDAWG Polystore

SystemPresented by: Shahab Helmi

Fall 2015

Paper InfoAuthors:

Publication: VLDB 2015

Type: Demonstration Paper

Introduction / Motivation

“One size does not fit all”:

MIMIC II is a publicly accessible dataset covering about 26,000 ICU admissions at Boston’s Beth Israel Deaconess Hospital: Waveform data (up to 125 Hz measurements from bedside devices):

SciDB: in the format of time-series (array).

S-Store: device stream information.

Patient metadata (name, age, …): PostgreSQL.

Doctors’ and nurses’ notes (text). Apache Accumulo: stores the associated text data in a key-value store.

lab results, and prescriptions filled (both semi-structured data).

Historical data + real-time feeds from current patients.


It is hard for programmers to: Use different databases in their applications.

Learn new query languages.

BigDAWG

BigDAWG is a reference implementation of a new architecture for “Big Data” applications. Intel Science and Technology Center (ISTC).

Its UI provides:

Data browsing.

Exploratory Analysis.

Complex Analytics (liner regression, fast Fourier transformation …).

Real-Time Monitoring.

BigDAWG Architecture

The goal is to enable users to enjoy the performance advantages of multiple vertically-integrated systems (such as column stores, NewSQL engines, and array stores) without sacrificing the expressiveness of their queries nor burdening the user with learning multiple front-end languages.

Each island is a front-facing abstraction for the user, and it includes a query language, data model, and a set of connectors or shims for interacting with the underlying storage engines that it is federating.

Real-Time Analytical Processing with SQL

Server Presented by: Shahab Helmi

Fall 2015

Paper InfoAuthors:

Publication: VLDB 2015

Type: Industry Paper


Transactional processing (OLTP) and analytical processing are traditionally separated and running on different systems.

Separation reducing the load on transactional systems which makes it easier to ensure consistent throughput and response times for business critical applications.

Users are increasingly demanding access to ever fresher data also for analytical purposes. The freshest data resides on transactional systems so the most up-to-date results are

obtained by running analytical queries directly against the transactional database.

Introduction / Motivation (2)

Over the last two releases SQL Server has added column store indexes (CSI) and batch mode (vectorized) processing to speed up analytical queries and the Hekaton in-memory engine to speed up OLTP transactions.

Each feature works for a specific workload: Column-store indexes are optimized for large scans but operations such as point lookups or

small range scans also require a complete scan, which is clearly prohibitively expensive.

lookups are very fast in in-memory tables but complete table scans are expensive because of the large numbers of cache and TLB misses and the high instruction and cycle count associated with row-at-a-time processing.

SQL Server 2016 will include several enhancements that are targeted primarily for such hybrid workloads.

SQL Server 2016 Enhancements on Hybrid Workloads

1. Columnstore indexes on in-memory tables. greatly speed up queries that require complete table scans.

2. Updatable secondary columnstore indexes. Secondary CSIs on disk-based tables were introduced in SQL Server 2012. However, adding a CSI makes the table read-only. This limitation will be remedied in SQL Server 2016.

3. B-tree indexes on primary columnstore indexes. CSI works great for data warehousing applications but not for point lookup and small region scans (needs the whole database scan). To speed up such operations users will be able to create normal B-tree indexes on primary

column stores.

4. Column store scan improvements. The new scan operator makes use of SIMD (single instruction, multiple data) instructions and

the handling of filters and aggregates has been extended and improved.

Faster scans speed up many analytical queries considerably.

SQL Server 2016 Enhancements on Hybrid Workloads (2)

Experimental Results

Cost of inserts, updates, deletes and their effect on query time

Experimental Results (2)

Scan speedup using CSI


Query performance with and without scan enhancements (using 12 cores)


Comparing performance of basic operations with and without SIMD instructions

Efficient Evaluation of Object-Centric Exploration Queries for Visualization

Type: Research Paper

Authors: You Wu, Jun Yang(Duke Uni.), Boulos Harb, Cong Yu (Google)

Presented by: Siddhant Kulkarni

Term: Fall 2015

Motivation

Effective way to explore data? Example:

Player LeBron James has scored 35 or more points in every one of the last k games

Identifying clusters and outliers

Example

Contribution

Related work focuses on other visualization techniques Efficient evaluation through sampling

Ssparse

SSketch

Proposed algorithm to consider the allocated budget

A Demonstration of AQWA: Adaptive Query-Workload-Aware Partitioning of Big

Spatial DataType: Demonstration Paper

Authors: Ahmed M. Aly, Ahmed S. Abdelhamid, Ahmed R. Mahmood, Walid G. Aref, Mohamed S. Hassan(Purdue), Hazem Elmeleegy(Turn Inc.),

Mourad Ouzzani(Qatar Computing Research Institute)

Presented by: Siddhant Kulkarni

Term: Fall 2015

Motivation

Too many location aware devices! Too much location data!! Other systems use Static Data partitioning

The idea

Four Steps Initialization Query Execution Data Acquisition Repartitioning

Contribution

AQWA – adapts to query workload and data distribution and incrementally updates the partitioning!

Deployed on Hadoop

Spark

presented by: zohreh raghebi fall 2015. konstantinos xirogiannopoulos udayan khurana amol deshpande...

Documents

graph representation

extracted graph

graph algorithmse

graph extractionand

graph reprsentation

graph analytics algorithms

graphstructured data

complex graph algorithms