large-scale distributed storage system for business provenance - cloud 2011
TRANSCRIPT
Building a Smarter Planet
Large-Scale Distributed Storage System for Business Provenance
CLOUD 2011
Szabolcs Rozsnyai, Aleksander Slominsiki, Yurdaer Doganata
Building a Smarter Planet
Agenda
• Introduction and Motivation• Provenance in the Cloud
– System Overview– Data Model– Data Integration– Indexing– Querying
Building a Smarter Planet
Introduction and Motivation 1/2
• What is Business Provenance (BP)?– Monitors enterprise systems to track a consistent and accurate
history of processes incl. involved artifacts, events and relationships– BP enables a comprehensive and complete insight into the processes– BP helps to discover the functional, organizational, data and
resource aspects of a business
• Challenges– Volume and the complexity makes tracking and processing a difficult
and resource intensive task– As data grows at a very high rate, tracking arbitrary artifacts for
provenance purposes within large organizations is very costly– Storing, organizing, retrieving and analyzing the artifacts necessitate
allocating large amount of computing resources
Building a Smarter Planet
Introduction and Motivation 2/2
• Challenges cont.– With current technologies (RDBMS) trade-offs need to be made
between the amount of captured data and the granularity levels – Aggregation is a solution but lacks of data that would enable drill-down
actions in context of root-cause analysis (e.g. missing low-level events)– Artifact types and the granularity level already implies that there is
certain knowledge available for the analytics-phase. • Might be good enough to satisfy legal requirements or certain
compliance applications– In general leaving out data reduces the opportunities for better
insight and the possibility of gaining new knowledge about the process.
Building a Smarter Planet
What is the problem with traditional technologies?
Data Warehousing
– They lack in providing process knowledge and thus hinder operational insights.
Complex Event Processing (post-event analysis)
– Events can be branched off and stored from continuous streams
– The relationships (i.e. correlations) are preserved
– Information allows to derive insights about the business processes
Based on RDBMS for which the storing and querying large (web-scale) amount of data is costly
Scaling with RDBMS (Storage and performance) comes with high investments due to specialized hardware and license costs.
Investments do not necessarily justify the potential benefits.
Building a Smarter Planet
Why cloud-based storage?
• Provides the illusion of infinite computing and data storage resources
• Organizations can increase resources with increasing demands
• Eliminates large up-front commitments (low and incremental costs)
• Can satisfy short-term requirements • Cover peak loads• Hot Deployment of resources
• Simpler and faster maintenances
• Support huge datasets and high request rates based on large number of commodity servers
• Resources capacities can me modified on demand and in a timely manner by adding, changing or removing instances
• Provide a high level of availability, seamless failover and recovery handling across heterogeneous commodity hardware landscapes
Scalability
Elasticity
Availability
Characteristics Benefits
Cloud-based storages sacrifice the complex query capabilities and sophisticated transaction models found in traditional systems
Building a Smarter Planet
Provenance System Overview
Building a Smarter Planet
Hbase Data Model
• Tables don’t have a defined schema (i.e. each row of a table can have different attributes)
• Columns are grouped by column-families• Each row has a sorted key and a timestamp• Everything except the tablename is stored
as byte[]
Characteristics
Building a Smarter Planet
Data Integration
Schema-less structure easily allows to “dump” everything into data storage following a LET (Load Extract Transform) paradigm in contrast to classical ETL approaches
Get all data independent of it’s source and type• You might never know what data you want to
analyze at a later point of time• There is no need to make a compromise here as the
storage is relatively cheap • The space is available• The performance is preserved trough horizontal
scaling of the data
Building a Smarter Planet
Data Indexing• Create a inverted index for the extracted property
• IndexTableName: Attributename• Key: Value + KeyOfRow (John$$b2f59d10-903d-…) • Value: dummy (not used)
• Reference to KeyOfRow form the indexed table is encoded into the key of the indextable in order to be able to perform range scans. Otherwise the columns would grow extremely large
Common Alias
Store RAW dataInverted Index
Composite Index
4
5
Key ValueSzabolcs_Ref1Szabolcs_Ref2
Alek_Ref3… …
Building a Smarter Planet
Composite Indexing
• Composite Indexes allows to optimized towards fast querying• Example:
– Search for firstname and lastname – Composite Index
• Tablename: AttributenameA + AttributenameB• Key: firstname + lastname + Ref1• Value: dummy
Common Alias
Store RAW dataInverted Index
Composite Index
4
5
Key ValueSzabolcs$Rozsnyai _Ref1Szabolcs$Rozsnyai _Ref2
Alek$Slominski_Ref3… …
Building a Smarter Planet
Querying
Bad News
• Simple Key Lookups to retrieve values are easy to realize but there are is no declarative query language or any means to express more sophisticated constructs such as joins
• No optimizations on declarative queries
• Queries often require set operations (such as intersections)
• There is no facility/algorithm out-of-the-box that deals with efficient memory usage for instance
• SQL Queries Algorithms need to be “re-implemented”
Good News
• Applications (such as Provenance) have a well defined set of (parameterized) queries
• Most of key-store implementation stores keys in sorted order and supports range scan on keys with paging (prefix, startkey, page size, …)
• Otherwise we would need to do to break list into pieces (ex. file inode-like structure)
Building a Smarter Planet
Querying
• Simple Queries:• Search By Attribute• Boolean Search
• Filter Query• Traversing graphs of Relationships
Building a Smarter Planet
Querying
• Returns all rows where the specified attribute corresponds the given search value
List<PStoreRecord> recordRetList = recordDAO.searchByAttribute(“person:firstname”, “Szabolcs");
• Allows to combine search value lookups
• If there has been a composite index defined for the three attributes in the example the implementation has to perform only one lookup in the index
// create searchTermsHashMap<String, String> searchTerms = new HashMap<String, String>();
searchTerms.put("person:firstName", "Michael1");searchTerms.put("person:lastName", "Smith1");searchTerms.put("person:userId", "msmith1");
List<PStoreRecord> resultRecordList = recordDAO.searchBooleanOperator(searchTerms, HBasePStoreRecordDAO.AND_OPERATOR);
Search By Attribute
Boolean Search
Building a Smarter Planet
Filter Operator• Performs joins over several “relations”, can be used to represent (correlation) rules
from the Provenance• Example:
WHERE OrderReceived.userId = “srozsnyai213“ AND OrderReceived.orderId = ShipmentCreated.orderId AND ShipmentCreated.shipmentId = TransportStarted.shipmentId AND TransportStarted.TransportId = TransportEnded.TransportId AND OrderReceived.Type = “OrderReceived“ ANDShipmentCreated.Type = “ShipmentCreated“ ANDTransportStarted.Type = “TransportStarted“ ANDTransportEnded.Type = “TransportEnded“
Building a Smarter Planet
Some Evaluation
Operation No of Operations Type of Operation
Inserting Record 1 per record Write
Inverted Indexing1 per attribute per record
Write
Composite Indexing 1 per attribute per record Write
Search By Attribute 1 per search Scan with prefix filter
1 per reference retrieved from index
Read
Boolean Search w. Composite Index
1 per search Scan with prefix filter
1 per reference retrieved from index
Read
Boolean Search w.o. Composite Index
1 per sub-expression connected with a boolean operator in a search
Scan with prefix filter
1 per reference retrieved from index for one expression
Read
Filter Query For a sub-expression with a join Boolean Search is executedand for the rest a Search by Attribute
Process simulator relating to an export compliance regulations use-case
Wide range of heterogeneous systems (Order Management, Document Management, E-Mail, Export Violation Detection Services, … ) as well as workflow-supported human-driven interactions (Process Management System). All of those systems generate a wide range of events at different granularity levels which allows us to create a comprehensive graph of relationships.
Building a Smarter Planet
Future and Ongoing Work
• (Distributed) Business Process Analytics– Correlation Discovery– Process Mining– Predictive Analytics– Improve query expressiveness (Hive, Pig, …)