large-scale distributed storage system for business provenance - cloud 2011

Building a Smarter Planet

Large-Scale Distributed Storage System for Business Provenance

CLOUD 2011

Szabolcs Rozsnyai, Aleksander Slominsiki, Yurdaer Doganata


Agenda

• Introduction and Motivation• Provenance in the Cloud

– System Overview– Data Model– Data Integration– Indexing– Querying


Introduction and Motivation 1/2

• What is Business Provenance (BP)?– Monitors enterprise systems to track a consistent and accurate

history of processes incl. involved artifacts, events and relationships– BP enables a comprehensive and complete insight into the processes– BP helps to discover the functional, organizational, data and

resource aspects of a business

• Challenges– Volume and the complexity makes tracking and processing a difficult

and resource intensive task– As data grows at a very high rate, tracking arbitrary artifacts for

provenance purposes within large organizations is very costly– Storing, organizing, retrieving and analyzing the artifacts necessitate

allocating large amount of computing resources


Introduction and Motivation 2/2

• Challenges cont.– With current technologies (RDBMS) trade-offs need to be made

between the amount of captured data and the granularity levels – Aggregation is a solution but lacks of data that would enable drill-down

actions in context of root-cause analysis (e.g. missing low-level events)– Artifact types and the granularity level already implies that there is

certain knowledge available for the analytics-phase. • Might be good enough to satisfy legal requirements or certain

compliance applications– In general leaving out data reduces the opportunities for better

insight and the possibility of gaining new knowledge about the process.


What is the problem with traditional technologies?

Data Warehousing

– They lack in providing process knowledge and thus hinder operational insights.

Complex Event Processing (post-event analysis)

– Events can be branched off and stored from continuous streams

– The relationships (i.e. correlations) are preserved

– Information allows to derive insights about the business processes

Based on RDBMS for which the storing and querying large (web-scale) amount of data is costly

Scaling with RDBMS (Storage and performance) comes with high investments due to specialized hardware and license costs.

Investments do not necessarily justify the potential benefits.


Why cloud-based storage?

• Provides the illusion of infinite computing and data storage resources

• Organizations can increase resources with increasing demands

• Eliminates large up-front commitments (low and incremental costs)

• Can satisfy short-term requirements • Cover peak loads• Hot Deployment of resources

• Simpler and faster maintenances

• Support huge datasets and high request rates based on large number of commodity servers

• Resources capacities can me modified on demand and in a timely manner by adding, changing or removing instances

• Provide a high level of availability, seamless failover and recovery handling across heterogeneous commodity hardware landscapes

Scalability

Elasticity

Availability

Characteristics Benefits

Cloud-based storages sacrifice the complex query capabilities and sophisticated transaction models found in traditional systems


Provenance System Overview


Hbase Data Model

• Tables don’t have a defined schema (i.e. each row of a table can have different attributes)

• Columns are grouped by column-families• Each row has a sorted key and a timestamp• Everything except the tablename is stored

as byte[]

Characteristics


Data Integration

Schema-less structure easily allows to “dump” everything into data storage following a LET (Load Extract Transform) paradigm in contrast to classical ETL approaches

Get all data independent of it’s source and type• You might never know what data you want to

analyze at a later point of time• There is no need to make a compromise here as the

storage is relatively cheap • The space is available• The performance is preserved trough horizontal

scaling of the data


Data Indexing• Create a inverted index for the extracted property

• IndexTableName: Attributename• Key: Value + KeyOfRow (John$$b2f59d10-903d-…) • Value: dummy (not used)

• Reference to KeyOfRow form the indexed table is encoded into the key of the indextable in order to be able to perform range scans. Otherwise the columns would grow extremely large

Common Alias

Store RAW dataInverted Index

Composite Index

4

5

Key ValueSzabolcs_Ref1Szabolcs_Ref2

Alek_Ref3… …


Composite Indexing

• Composite Indexes allows to optimized towards fast querying• Example:

– Search for firstname and lastname – Composite Index

• Tablename: AttributenameA + AttributenameB• Key: firstname + lastname + Ref1• Value: dummy

Common Alias

Store RAW dataInverted Index

Composite Index

4

5

Key ValueSzabolcs$Rozsnyai _Ref1Szabolcs$Rozsnyai _Ref2

Alek$Slominski_Ref3… …


Querying

Bad News

• Simple Key Lookups to retrieve values are easy to realize but there are is no declarative query language or any means to express more sophisticated constructs such as joins

• No optimizations on declarative queries

• Queries often require set operations (such as intersections)

• There is no facility/algorithm out-of-the-box that deals with efficient memory usage for instance

• SQL Queries Algorithms need to be “re-implemented”

Good News

• Applications (such as Provenance) have a well defined set of (parameterized) queries

• Most of key-store implementation stores keys in sorted order and supports range scan on keys with paging (prefix, startkey, page size, …)

• Otherwise we would need to do to break list into pieces (ex. file inode-like structure)


Querying

• Simple Queries:• Search By Attribute• Boolean Search

• Filter Query• Traversing graphs of Relationships


Querying

• Returns all rows where the specified attribute corresponds the given search value

List<PStoreRecord> recordRetList = recordDAO.searchByAttribute(“person:firstname”, “Szabolcs");

• Allows to combine search value lookups

• If there has been a composite index defined for the three attributes in the example the implementation has to perform only one lookup in the index

// create searchTermsHashMap<String, String> searchTerms = new HashMap<String, String>();

searchTerms.put("person:firstName", "Michael1");searchTerms.put("person:lastName", "Smith1");searchTerms.put("person:userId", "msmith1");

List<PStoreRecord> resultRecordList = recordDAO.searchBooleanOperator(searchTerms, HBasePStoreRecordDAO.AND_OPERATOR);

Search By Attribute

Boolean Search


Filter Operator• Performs joins over several “relations”, can be used to represent (correlation) rules

from the Provenance• Example:

WHERE OrderReceived.userId = “srozsnyai213“ AND OrderReceived.orderId = ShipmentCreated.orderId AND ShipmentCreated.shipmentId = TransportStarted.shipmentId AND TransportStarted.TransportId = TransportEnded.TransportId AND OrderReceived.Type = “OrderReceived“ ANDShipmentCreated.Type = “ShipmentCreated“ ANDTransportStarted.Type = “TransportStarted“ ANDTransportEnded.Type = “TransportEnded“


Some Evaluation

Operation No of Operations Type of Operation

Inserting Record 1 per record Write

Inverted Indexing1 per attribute per record

Write

Composite Indexing 1 per attribute per record Write

Search By Attribute 1 per search Scan with prefix filter

1 per reference retrieved from index

Read

Boolean Search w. Composite Index

1 per search Scan with prefix filter

1 per reference retrieved from index

Read

Boolean Search w.o. Composite Index

1 per sub-expression connected with a boolean operator in a search

Scan with prefix filter

1 per reference retrieved from index for one expression

Read

Filter Query For a sub-expression with a join Boolean Search is executedand for the rest a Search by Attribute

Process simulator relating to an export compliance regulations use-case

Wide range of heterogeneous systems (Order Management, Document Management, E-Mail, Export Violation Detection Services, … ) as well as workflow-supported human-driven interactions (Process Management System). All of those systems generate a wide range of events at different granularity levels which allows us to create a comprehensive graph of relationships.


Future and Ongoing Work

• (Distributed) Business Process Analytics– Correlation Discovery– Process Mining– Predictive Analytics– Improve query expressiveness (Hive, Pig, …)

large-scale distributed storage system for business provenance - cloud 2011

Technology

data warehousing

captured data

smarter planetintroduction

business provenance

business provenance

motivation provenance

large organizations

smarter planetwhat