presented by: zohreh fall 2015. nandish jayaram university of texas at arlington sidharth goyal ...
TRANSCRIPT
![Page 1: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/1.jpg)
VIIQ: AutoSuggestionEnabled Visual Interface for
Interactive Graph Query FormulationPresented by: Zohreh
Fall 2015
![Page 2: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/2.jpg)
Authors
Nandish Jayaram
University of Texas at Arlington
Sidharth Goyal
University of Texas at Arlington
Chengkai Li
University of Texas at Arlington
![Page 3: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/3.jpg)
Introduction
An unprecedented proliferation of heterogeneous graph
with thousands of node/edge types
Complex relationships in schema-less data
Query graphs are used to:
specify the query intent for such graphs
Formulating these query graphs is a daunting task
users to know a vocabulary comprised of many labels and types
![Page 4: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/4.jpg)
Introduction
Graph query systems allow users to construct query graphs
through a visual interface
The focus of these systems is query processing
Their query formulation components are limited to a graphical platform
To add nodes and edges with ease using mouse and keyboard actions
Little help is offered to easily choose the labels
various components in a query graph
![Page 5: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/5.jpg)
Introduction
Every time a new query component is added
Users are inundated with possibly hundreds of options
For the new component’s label, sorted alphabetically.
It is a daunting task to browse through all the options
To select the appropriate label to add
![Page 6: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/6.jpg)
Related works
There are other querying paradigms that help users query graph data
Declarative languages like SPARQL are used to exactly specify query intent
But present a usability barrier
Simplify query formulation
Keyword search, approximate graph query and query-by-example
Cannot be used to specify users’ exact query intent
Existing systems help users specify queries either easily or exactly,
But not both
![Page 7: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/7.jpg)
VIIQ (Visual Interface for Interactive graph Query formulation),
To easily construct various query graph components
VIIQ automatically suggests new edges and nodes to add
To a partially constructed query graph
Users can also add nodes or edges manually,
Whose labels are ranked
presented on how likely they will be of interest to the user
VIIQ is the first visual query formulation system
Actively makes ranked suggestions
![Page 8: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/8.jpg)
Contribution
VIIQ supports two modes of operation, passive and active
By default VIIQ operates in passive mode
Based on the partially constructed query graph
the system automatically recommends top-k new edges
relevant to the user’s query intent
![Page 9: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/9.jpg)
Fig. 3 shows the snapshot of a partially constructed query graph, with nodes and edges suggested in passive mode.
The nodes in grey and the edges incident on them are the new automatic suggestions made by the system.
![Page 10: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/10.jpg)
Contribution
The active mode is triggered
the user adds new nodes or edges to the partial query graph
For a newly added node, the suggested labels are displayed hierarchically
In a pop-up box
For a newly added edge, the suggested edge labels are ranked
based on the likelihood of their relevance to the user’s query intent
![Page 11: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/11.jpg)
On Uncertain Graphs Modeling and Queries
Presented by: Zohreh
Fall 2015
![Page 12: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/12.jpg)
Authors
Arijit Khan Lei Chen† Systems Group,∗ ∗
ETH Zurich, Switzerland †
The Hong Kong University of Science and Technology
![Page 13: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/13.jpg)
Introduction
Availability of network data have increased dramatically
Uncertainty is evident in graph data due to a variety of reasons
such as noisy measurements, inconsistent, incorrect, and possibly ambiguous information sources
In these cases, data is represented as an uncertain graph
Nodes, edges, and attributes are accompanied with a probability of existence
![Page 14: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/14.jpg)
MODELING OF UNCERTAIN GRAPHS
Uncertainty Models Independent Probabilities:
Components in the graph independent from one another
Interprets uncertain graphs according to the well-known possible-world semantics
For example, an uncertain graph with m edges yields 2 power of m possible deterministic graphs
![Page 15: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/15.jpg)
Correlated Probabilities
Ignores the correlations among various graph components
For example, in a traffic network:
If a road is crowded at a certain point of time
There are a few works that model such correlations with conditional probabilities
![Page 16: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/16.jpg)
Challenges: semantic and computation
From the perspective of the semantics:
There is no uniform model of uncertain graphs;
Assignment and interpretation of the probabilities
application specific.
Define the shortest path between two nodes in an uncertain graph?
The definition could depend on the application and the specific uncertainty semantic
![Page 17: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/17.jpg)
Challenges : computation perspective
While many graph algorithms such as subgraph isomorphism are intrinsically hard problems,
Even the simplest graph algorithms such as reachability and shortest path queries become #P-complete;
More expensive over uncertain graphs
Therefore, exact computation is almost infeasible
with today’s large-scale graph data
Focus now-a-days is towards designing approximation algorithms
With efficient sampling, indexing, and filtering strategies
![Page 18: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/18.jpg)
MAJOR OPEN PROBLEMS
An exact computation is infeasible
Over large scale uncertain graphs,
It is important to identify the application areas
e.g. efficiency vs. effectiveness
To re-define the semantics of many classical graph operations
e.g., centrality measure and graph partitioning
![Page 19: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/19.jpg)
A Demonstration of TripleProv:Tracking and Querying Provenance over
Web Data
Presented by: Ashkan Malekloo
Fall 2015
![Page 20: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/20.jpg)
Sharing and Reproducing Database Applications
Type: Demonstration paper
Authors:
VLDB 15
Marcin Wylot , Philippe Cudre �-Mauroux, Paul Groth
![Page 21: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/21.jpg)
Introduction
Heterogeneity of RDF data
Ease of integration
Examples:
one may want to analyze which sources were instrumental in providing results
How data sources were combined
Filtering the result
![Page 22: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/22.jpg)
Introduction
Find me all the titles of articles about “Obama”but derive the answer only from sources attributed to “US News”.
![Page 23: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/23.jpg)
Introduction
No current triple store is able to automatically derive provenance data for the results it produces or to tailor queries with provenance data.
Storing Quadruples
Named Graphs
![Page 24: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/24.jpg)
TripleProv
A new RDF database system supporting the transparent and automatic derivation of detailed provenance information for arbitrary queries and the execution of queries incorporating provenance predicates
It is based on a native RDF store
Enables to trace provenance at two different granularity levels
![Page 25: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/25.jpg)
SAASFEE: Scalable Scientific Workflow Execution Engine
Marc Bux Jo rgen Brandt ̈� Carsten Lipka
Kamal Hakimzadeh Jim Dowling Ulf Leser
Demonstration Paper
2015 VLDB
Presented by: Omar Alqahtani
Fall 2015
![Page 26: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/26.jpg)
Motivation
Scientific data is analyzed by complex pipelines composed of highly specialized, domain-dependent tools.
SWfMSs facilitate the design, implementation, execution, optimization, monitoring, and exchange of such heterogeneous pipelines.
![Page 27: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/27.jpg)
Existing SWfMSs
Roughly divided into three groups:
Taverna, Kepler, Galaxy
Askalon, Pegasus
YARN, MESOS
No platform capable of:
Embrace the ever-evolving research tools
Scaling to very large data sets
Executing arbitrarily complex workflows.
![Page 28: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/28.jpg)
SAASFEE
It is a SWfMS which runs arbitrarily complex workflows on Hadoop YARN.
SAASFEE workflows are specified in Cuneiform.
Cuneiform workflows are executed on Hi-WAY.
![Page 29: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/29.jpg)
Capabilities
The ability to execute iterative workflows,
An adaptive task scheduler,
Re-executable provenance traces,
Compatibility to selected other workflow systems
![Page 30: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/30.jpg)
Tutorial: SQL-on-Hadoop Systems
Presented by:
Ranjan
Fall 2015
![Page 31: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/31.jpg)
Why SQL? Data exploration
Structured data
organization of the data in tables
optimized data access
Declarative data processing
No need to have developer skills
Portable – universal language
SQL drivers supported
No need of Hadoop client installation
Easier integration with the current systems
![Page 32: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/32.jpg)
Hadoop overview
![Page 33: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/33.jpg)
These factors complicate the query optimization further in the Hadoop system
First, in the world of Hadoop and HDFS data, complex data types, such as arrays, maps, structs, as well as JSON data are more prevalent.
Second, the users utilize UDFs (user-defined-functions) very widely to express their business logic, which is sometimes very awkward to express in SQL itself.
Third, often times there is little control over HDFS. Files can be added or modified outside the tight control of a query engine, making statistics maintenance a challenge.
![Page 34: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/34.jpg)
Hive The first SQL-on-Hadoop offering that provided an SQL-like query language, called HiveQL,
and used MapReduce run-time to execute queries.
Hadapt Hadapt, which spun out of the HadoopDB research project, was the first commercial SQL-on-
Hadoop offering. Hadapt and HadoopDB replaced the file-oriented HDFS storage formats with DBMS-oriented storage, including column-store.
![Page 35: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/35.jpg)
Spark Spark is a fast, general purpose cluster computing engine that is compatible with
Hadoop data and tries to address the shortcomings of MapReduce. Systems that use Spark as their run-time for SQL processing:
-Shark , Hive on Spark , and Spark SQL.
Cloudera Impala
fully-integrated MPP SQL query engine.
Impala reads at almost disk bandwidth and is typically able to saturate all available disks.
![Page 36: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/36.jpg)
IBM Big SQL
It leverages IBM’s state-of-the-art relational database technology, to processes standard SQL queries over HDFS data, supporting all common Hadoop file formats, without introducing any propriety formats.
Apache Drill
providing SQL-like declarative processing over self-describing semi-structured data.
Its focus is on analyzing data without imposing a fixed schema or creating tables in a catalog like Hive MetaStore.
Splice Machine
Splice Machine provides SQL support over HBase data using Apache Derby, targeting both operational as well as analytical workloads.
![Page 37: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/37.jpg)
Phoenix Phoenix provides SQL querying over HBase via an embeddable JDBC driver built for high
performance and read/write operations
![Page 38: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/38.jpg)
Collaborative Data Analytics with DataHub
Authors: Anant Bhardwaj, Amol Deshpande, Aaron J. Elmore, David Karger, Sam Madden, Aditya Parameswaran, Harihar Subramanyam, Eugene Wu, Rebecca Zhang.
Type: Demonstration paper
Presented by: Dardan Xhymshiti
Fall 2015
![Page 39: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/39.jpg)
Major problem Organizations and companies collect data from various sources like:
Financial transactions,
Server logs,
Sensor data etc.
Teams and individuals inside the company want to use these dataset for extracting knowledge from them, using their home-grown tools, company tools, different programming languages, so making modifications on the data set (normalization, cleaning) and then exchanging these dataset back and forth.
Problem: collaborative data analysis. Heterogeneity of tools, diversity in skill-set of individuals and teams, difficulties
on sorting, difficulties on retrieving and versioning of the exchanged datasets.
![Page 40: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/40.jpg)
Major motivation The authors motivate they work by providing two examples: Example 1: Expert analysis:
Members of an web advertising team want to extract knowledge from an unstructured ad-click data. They write a script for extracting the task-relevant information from the data, and store it as a separate dataset which will be shared across the team.
Problems:
Different team members may be more comfortable with a particular tool: R, Python, Awk, and use these tool to clean, normalize and summarize the dataset.
More proficient members use multiple languages for different purposes:
• Modeling in R.
• Visualization in JavaScript
• String extraction in Awl etc.
![Page 41: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/41.jpg)
The team members manage the data set versions by recording it within a file with name: table_v1, table_v1.1 ….
Versioning is difficult to manage in case of a hundred data set versions. The final result…:
![Page 42: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/42.jpg)
The team members manage the data set versions by recording it within a file with name: table_v1, table_v1.1 ….
Versioning is difficult to manage in case of a hundred data set versions. The final result…:
![Page 43: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/43.jpg)
Example 2: Novice analysis: The coach and players of a football team want to study, query and visualize
their performance over the last season. Probably they are going to use a tool like Excel for storing their data set, which
have limited support on querying, cleaning, analysis or versioning. Query example: The coach wants to find all the games where a star player was
absent? Most of the team players are not proficient with data analysis tools, such as
SQL or scripting languages. Solution of the problem: Point-and-click apps. These apps offer:
Easy load, query, visualize and share results with other users without much effort.
![Page 44: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/44.jpg)
These teams are unable to perform collaborative data analysis because of the lack of:
1. Flexible data sharing and versioning support
2. Point-and-click apps to help novice users do collaborative data analysis
3. Support for a number of data analysis languages and tools.
A tool for collaborative analysis can be used for example by genetics who want to share and collaborate on genome data with other research groups.
![Page 45: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/45.jpg)
Major Contribution To address these problems the paper presents DataHub a unified data management
and collaboration platform for hosting, sharing, combining and collaboratively analyzing datasets.
DataHub has three main components:
1. Flexible data storage, sharing, and versioning capabilities.
a) Keeps track of all versions of dataset.
b) Enables collaborative analysis, while at the same time allows storing and retrieving these datasets at various stages of analysis.
2. App ecosystem for easy querying, cleaning, and visualization.
a) Distill: data cleaning by example tool.
b) DataQ: a query builder tool that allows user to build SQL queries by direct manipulation in graphical user interface. Interface is suitable for non-technical users.
c) Dviz: Data visualization tool.
![Page 46: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/46.jpg)
3. Language-agnostic hooks for external data analysis.
For the team members that are proficient on different languages and libraries like: Python, R, Scala and Octave, the DataHub enable collaborative data analysis by using
Apache Thrift to translate between these languages and datasets in DataHub.
![Page 47: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/47.jpg)
Gorilla: A Fast, Scalable, In-Memory Time Series Database
Authors: Tomas Pelkonen, Scott Franklin, Justin Teller, Paul Cavallaro, Qi Huang, Justin Meza, Kaushik Veeraraghavan. (Facebook, Inc)
Presented by: Dardan Xhymshiti
Fall 2015
![Page 48: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/48.jpg)
Major problem Large-scale internet services (i,e. Facebook) must be highly available and
responsive in case of unexpected failures.
These large-scale services performs on a thousand of systems running on many thousand machines, that are located in different geographical areas.
These services also have a global audience of users.
Problems arise if there does not exist good failure monitoring systems.
![Page 49: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/49.jpg)
Major motivation
The authors, are motivated by the previous problems, to come up with a solution that best ensures the availability and responsiveness of large-scale internet services.
![Page 50: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/50.jpg)
Major contribution Authors present an in-memory Time Series Data Base (TSDB) called Gorilla, which in
second-basis gets measuring data points (CPU load, error rate, latency) from distributed machines, stores them in TSDB and perform queries on top of it.
Challenge: High data insertion rate, total data quantity, real-time aggregation and reliability.
Rather than storing measuring data points as individual data points, they are aggregated and then stored.
![Page 51: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/51.jpg)
Gorilla TSDB constraints:
Writes dominate
Always be in able to take tens of millions of data point each second.
Gorilla TSDB
![Page 52: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/52.jpg)
Gorilla TSDB constraints:
Writes dominate
Always be in able to take tens of millions of data point each second.
State transitions
We want to identify the issues that arise in case of new changes happening to the system.
Gorilla TSDB
![Page 53: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/53.jpg)
Gorilla TSDB constraints:
Writes dominate
Always be in able to take tens of millions of data point each second.
State transitions
We want to identify the issues that arise in case of new changes happening to the system.
Gorilla TSDB
New software release
![Page 54: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/54.jpg)
Gorilla TSDB constraints:
Writes dominate
Always be in able to take tens of millions of data point each second.
State transitions
We want to identify the issues that arise in case of new changes happening to the system.
Gorilla TSDB
New software release
A network cut
![Page 55: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/55.jpg)
Gorilla TSDB constraints:
Writes dominate
Always be in able to take tens of millions of data point each second.
State transitions
We want to identify the issues that arise in case of new changes happening to the system.
Gorilla TSDB
New software release
A network cut
Side effect of an configuration change.
![Page 56: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/56.jpg)
Gorilla TSDB constraints:
Writes dominate
Always be in able to take tens of millions of data point each second.
State transitions
We want to identify the issues that arise in case of new changes happening to the system.
High availabilityIf a failure causes disconnections between datacenters, systems operating at these data centers must be able to write data to local TSDB machines.
Gorilla TSDB
![Page 57: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/57.jpg)
Gorilla TSDB constraints:
Writes dominate
Always be in able to take tens of millions of data point each second.
State transitions
We want to identify the issues that arise in case of new changes happening to the system.
Fault tolerance
The writes are replicated to multiple regions so in case of a datacenter failure, the data are survived.
High availabilityIf a failure causes disconnections between datacenters, systems operating at these data centers must be able to write data to local TSDB machines.
Gorilla TSDB
![Page 58: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/58.jpg)
Traditional ACID guarantees are not a core requirement for TSDB.
The writes must succeed at all times, even in the face of disasters.
Recent data points are of higher value than older data points (knowing if a particular system is broken right now is more valuable to an operations engineer than knowing if it was broken an hour ago).
Challenge: speed of query processing, writes and reads. Solution: Replacing the disk-based database with an in-memory database.
![Page 59: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/59.jpg)
Facts
In Spring 2015 Facebook’s monitoring system generated 12 billion data points per second.
12 billion data points per second = 1 trillion data points per day
Problem: 1 trillion data points * 16 byte = 16TB of RAM. (Too resource intensive) Solution: Using XOR based floating point compression , a data point from 16 bytes
was compressed to an average of 1.37 bytes (12x reduction on size).
![Page 60: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/60.jpg)
Compressed Spatial Hierarchical Bitmap (cSHB) Indexes for Efficiently Processing Spatial Range
Query WorkloadsPresented by: Shahab Helmi
Fall 2015
![Page 61: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/61.jpg)
Paper InfoAuthors:
Publication:
VLDB 2015
Type:
Research Paper
![Page 62: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/62.jpg)
Motivation: Bitmap-Based Indexing
Bitmap indexes have been shown to be highly effective in answering queries in data warehouses and column-oriented data stores. Why?
1. Efficient implementations of the bitwise logical operations: “AND”, “OR”, and “NOT”;
2. Provide significant opportunities for compression, enabling either reduced I/O or, even, complete in-memory maintenance of large index structures.
3. Query processors can operate directly on compressed bitmaps.
![Page 63: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/63.jpg)
Motivation: Quad-Tree Indexing
A quad-tree is a data structure used to divide a 2D region into more manageable parts. It's an extended binary tree, but instead of two child nodes it has four.
![Page 64: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/64.jpg)
Introduction (1)
The key principle behind most indexing mechanisms is to ensure that data objects closer to each other in the data space are also closer to each other on the storage medium.
Total order in 1D space: easy
Total order in nD: complicated!
Common solution? Partitioning the space hierarchically in such a way that (R/KD trees):
Nearby points fall into the same partition.
Point pairs that are far from each other fall into different partitions.
Alternative?
Mapping the multi-dimensional data to 1D and apply indexing and partitioning on the 1D data.
![Page 65: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/65.jpg)
Introduction (2)
Alternative?
Mapping the multi-dimensional data to 1D and apply indexing and partitioning on the 1D data such that:
Data objects closer to each other in the original space are also closer to each other on the 1D space.
Data objects further away from each other in the original space are also further away from each other on the 1D space.
How? fractal-based space-filling curves, In particular, the Peano-Hilbert curve and Z-order curve have been shown to be very effective in helping cluster nearby objects in the space.
![Page 66: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/66.jpg)
Contribution
It is shown that bitmap-based indexing is also an effective solution for managing spatial data sets.
proposed compressed spatial hierarchical bitmap (cSHB) indexes to support spatial range queries.
converting the given 2D space into a 1D space using Z-order traversal.
For spatial query processing:
A cost model was developed.
Choosing the best nodes for query processing according to the cost model.
![Page 67: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/67.jpg)
Contribution (2)
Contains the following 1D ranges(000010, 000011, 001000, 001001, 001010, 001011)
![Page 68: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/68.jpg)
Related Work
Multi-Dimensional Space Partitioning
Quad-tree, BD-tree, G-Tree, and KD-tree.
R-tree and its variants (R*-tree, R+-tree, Hilbert R-tree, and others).
Space Filling Curve based Indexing
Peano-Hilbert curve: better mapping but costly.
Z-order curve: efficient (used in this paper).
Bitmap Indexes
![Page 69: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/69.jpg)
Experimental Results
Datasets: 100 million synthetically generated data points ranging from <−180,−90> to <180, 90>.
A clustered data set from Gowalla, which contains the locations of check-ins made by users.
A clustered data set from OpenStreetMap (OSM).
![Page 70: Presented by: Zohreh Fall 2015. Nandish Jayaram University of Texas at Arlington Sidharth Goyal University of Texas at Arlington Chengkai Li](https://reader035.vdocuments.mx/reader035/viewer/2022062804/5697bf8d1a28abf838c8c437/html5/thumbnails/70.jpg)
Experimental Results (2)