esgyndbdbms capable of hosting these applications and their data. esgyndb innovations built upon...

© Copyright 2016 Esgyn Corporation. The information contained herein is subject to change without notice. January 2016 of 14

EsgynDB ENTERPRISE CLASS OPERATIONAL SQL-ON-HADOOP

EsgynDB is built on Apache Trafodion (incubating), an open source initiative to

deliver an enterprise class SQL-on-Hadoop DBMS engine that specifically targets

transactional protected operational workloads. Trafodion represents the combination

of Apache HBase and transactional SQL technologies that leverage more than 20

years of development investment into database technology and solutions.

Introducing EsgynDB

EsgynDB is based on Trafodion, an open source initiative to develop an enterprise class SQL-on-Hadoop DBMS engine that specifically targets big data transactional or operational workloads. Transactional SQL encompasses workloads described as OLTP (online transaction processing) workloads which support traditional enterprise-level transactional applications (ERP, CRM, etc.) and enterprise business processes – applications essential for the day-to-day operation of the company. Additionally, transactions have evolved to include social and mobile data interactions and observations, using a mixture of structured and semi-structured data.

EsgynDB overview

In this paper database capabilities will be attributed to Trafodion, with specific EsgynDB capabilities called out explicitly. Here is an overview of those capabilities.

• Delivers comprehensive and full-functioned SQL DBMS which allows companies to reuse and leverage existing SQL skills to improve developer productivity.

• Extends Hadoop® HBase by adding support for ACID (atomic, consistent, isolated and durable) transaction protection that guarantees data consistency across multiple rows, tables, SQL statements.

• Supports full active-active distributed transactions support across data centers to scale read/write workloads with zero lost transactions for disaster recovery. (EsgynDB only)

• Includes many optimizations for low-latency read and write transactions in support of the high concurrency and fast response time requirements of transactional SQL workloads.

• Includes sophisticated parallel database engine capabilities to support complex reporting queries at high concurrency and throughput.

• Enables hosted applications to seamlessly access and join structured data in Trafodion tables and semi-structured data in native HBase and Hive tables, without expensive replication or data movement overhead. All references to Hive tables in this white paper are for tables with textfile and sequence file formats registered in the Hive catalog. Currently, other Hive supported file formats, such as ORC Files are not supported.

• Provides interoperability with new or existing applications and 3rd party tools via support for standard ODBC and JDBC access.

• Fits seamlessly within the existing IT infrastructure, with no vendor lock-in, remaining neutral to the underlying Linux and Hadoop distributions.


Targeted Hadoop workload profile

Hadoop workloads can be broadly categorized into 4 different workload types as shown in Figure 1 i.e. Batch, Non-Interactive, Interactive, and Operational. These categories vary greatly in terms of their response time expectations, concurrency, as well as the amount of data that is typically processed. The leftmost 3 categories are where the marketplace (vendors and customers) have predominantly focused their attention. For the most part these categories represent efforts centered around “analytics” and business intelligence processing on “big data” problems. These workloads are well positioned to leverage Hadoop strengths and capabilities.

In contrast, the rightmost workload defined as “Operational” is an emerging Hadoop market category. Traditionally these workloads have been relegated to the domain of relational databases. But there is growing interest and pressure to embrace these workloads in Hadoop due to Hadoop’ s perceived benefits of significantly reduced costs, reduced vendor lock-in, and its ability to seamlessly scale to larger workloads and data. This is exactly the workload that Trafodion is targeting. Let’s next look at the characteristics and requirements of this workload to better understand how Trafodion addresses these.

Figure 1. Hadoop Workload Profiles

Transactional SQL application characteristics and challenges

Transactional protected operational workloads are typically deemed mission critical in nature because they help companies make money, touch their customers or prospects, or help them run and operate their business. Typically they have very stringent requirements in terms of response time expectations (sub-second), transactional data integrity, number of users, concurrency, availability, and data volumes. With the advent of the “growing internet of things”, the number and types of devices has driven tremendous transaction and data growth, and also changes in the type of data that needs to be captured and utilized as part of these transactions. Next generation operational applications often require multi-structured data types, which implies that operational data is evolving rapidly to include a variety of data formats and types of data, for example transactional structured data combined with text messages, review comments, visual images, etc.

Combined, these requirements can expose Hadoop limitations in terms of transaction support, zero lost transactions even in the face of disasters, bulletproof data integrity, sub-second response times, operational query optimization, and managing workloads comprised of a complex mix of concurrently executing transactions all with varying priorities. EsgynDB addresses each of these limitations. As a result it provides a differentiated DBMS capable of hosting these applications and their data.

EsgynDB innovations built upon Hadoop software stack

EsgynDB builds on Trafodion, which in turn is designed to build upon and leverage Apache Hadoop and HBase core modules. Operational applications using Trafodion transparently gain Hadoop’s advantages of affordable performance, scalability, elasticity, availability, etc. Figure 2 depicts a subset of the Hadoop software stack. Items in orange are specifically leveraged by Trafodion, namely HBase, HDFS, and Zookeeper. To this stack, Trafodion adds (items in green) ODBC/JDBC drivers, the Trafodion database software, and a new distributed transaction management (DTM) subsystem for transaction protection across multiple HBase regions.

Trafodion interfaces to Hadoop services using standard APIs thereby becoming Hadoop distribution neutral, eliminating vendor lock-in, by offering customers a choice of distributions to choose from.


Trafodion is initially targeted to deliver innovation on top of Hadoop in these key areas:

• A full-featured ANSI SQL implementation whose database services are accessible via a standard ODBC/JDBC connection

• A SQL relational schema abstraction which makes Trafodion look and feel like any other relational database

• Distributed ACID transaction protection

• Full cross data-center active-active distributed transaction support, provided by EsgynDB, to scale reads and writes, support local access and comply with safe harbor rules, with zero lost transactions in a disaster

• Performant response times for transactions comprised of both reads and writes

• Parallel optimizations for both transactional and operational reporting workloads

Figure 2. Trafodion EsgynDB and Hadoop Ecosystem

Leveraging HBase for performance, scalability, and availability

As stated previously, Trafodion is able to leverage all of the features and thereby all the advantages attributed to HBase including parallel performance, virtually unlimited scalability, elasticity, and availability protection.

These features are key to supporting operational workloads in production. For example:

• Fine grained load balancing, scalability, and parallel performance is provided via standard HBase services such as autosharding Trafodion table data across multiple regions and region servers.

• Data availability and recovery in the event a server or disk fails, or is decommissioned, is provided by standard Hadoop and HBase services such replication and snapshots.

• EsgynDB extends these capabilities provided by Trafodion, to support full ACID distributed reads and writes across data centers. This provides the ability to scale reads and writes across clusters, distributed data based on locality of access and safe harbor rules, and guarantee zero lost transactions in case of a disaster.

Trafodion is able to transparently leverage Hadoop distribution (e.g. Cloudera, Hortonworks) specific features and capabilities, since it accesses these distribution services via native HBase API’s. It integrates with HBase filters and coprocessors, for very high performance database updates and query access, as well as for its distributed transaction support. Powerful features such as compression, versioning, or cell level security can be leveraged for Trafodion tables as a result.

EsgynDB innovation – value add improvements over vanilla HBase

Although Trafodion stores its database objects in HBase/HDFS storage structures, it differs from and brings value-add over vanilla HBase in a multitude of ways as described below:

• It provides a relational schema abstraction on top of HBase, which allows customers to leverage known and well tested relational design methodologies and SQL programming skills.


• From a physical layout perspective, Trafodion uses standard HBase storage mechanisms (column family store using key-value pairs) to store and access objects. It leverages the HBase multiple column family support to separate out more frequently updated or accessed columns from less frequent ones, or to separate out infrequently accessed large columns into their own column family. Trafodion incorporates a column name encoding mechanism to save space on disk and to reduce messaging overhead for the purposes of improving SQL performance.

• Trafodion defined columns are assigned specific data types that are enforced by Trafodion when inserting or updating its data contents, unlike vanilla HBase which treats stored data as an uninterpreted array of bytes. This not only greatly improves data quality/integrity, it also eliminates the need to develop application logic to parse and interpret the data contents.

• Trafodion also provides an optional aligned format where all column values for a logical relational tuple are stored in a single HBase column value, thereby mapping a logical row to a single HBase row. For low update volume, high query workloads, accessing tables with a large number of columns, this can provide an incredible boost in performance.

• Trafodion extends ACID protection to application defined transactions that can span multiple SQL statements, multiple tables, and rows. Vanilla HBase provides ACID transaction protection only at the row level. This greatly improves database integrity by protecting the database against partially completed transactions i.e. ensuring that either the whole transaction is completely materialized in the database or none of it is.

• EsgynDB extends these Trafodion capabilities further by allowing a user to specify that a table needs to be replicated across specified data centers. Distributed transaction management across data centers, ensures that data is synchronously, in parallel, written out to other data centes as part of the transaction, to guarantee zero loss of transactionally committed data in case of a disaster, with minimal performance overhead. In contrast HBase does asynchronous replication of the write ahead lof of row-level transactionally committed data. This does not guarantee zero lost transactions.

• Trafodion’s API is ANSI SQL which is a familiar and well known programming interface and allows companies to leverage existing SQL knowledge, skills, and tools. In contrast, HBase’s native API is at a very low level and is not a commonly used programming API.

• Trafodion supports the common relational practice of allowing the primary key to be a composite key comprised of multiple columns, unlike HBase’s key structure which is comprised of a single uninterpreted array of bytes.

• Trafodion supports the creation of secondary indexes that can be used to speed transaction performance when accessing row data by a column value that is not the row key. Index to base table consistency is guaranteed via ACID transactional support. Vanilla HBase offers no such capability.

Salting of row keys

One known problematic area for HBase is supporting transactional workloads where data is inserted into a table in row key order. When this happens, all of the I/O gets concentrated to a single HBase region which in turn creates a server and disk hotspot and performance bottleneck. To alleviate this problem, Trafodion provides an innovative feature called “salting the row key”.

To enable this feature the DBA specifies the number of partitions (i.e. regions) the table is to be split over when creating the table e.g. “SALT USING 4 PARTITIONS”. Trafodion creates the table pre-split with one region per salt value. An internal hash value column, “_SALT_”, is added as a prefix to the row key. Salting is handled automatically by Trafodion and is transparent to application written SQL statements. As data is inserted into the table, Trafodion computes the salt value and directs the insert to the appropriate region. Likewise, Trafodion calculates the salt value when data is retrieved from the table and automatically generates predicates where feasible. MDAM technology (which is described in more detail in the section entitled “Optimizations for transactional SQL workloads” makes this process especially efficient. This is a very lightweight operation with little overhead or impact to direct key access operations.

The benefits of salting are that you get more even data distributions across regions and improved performance via hotspot elimination.

EsgynDB extends this Trafodion capability by enabling multiple salt keys to map to a single region. This facilitates splitting of the data in HBase regions as the cluster grows. The data is spread across more nodes, at the salt value boundaries, in order to rebalance I/O across the cluster. Enough values can be mapped to a region, to accommodate multiple future expansions of the cluster.


Another capability added by EsgynDB is that instead of using SALT, a SPLIT BY option is provided where explicit values can be specified for each region, for the columns used to split the data across the HBase regions. This provides the equivalent of range partitioning of the table across partition boundaries, whereas SALT supports more of a hash partitioning paradigm.

In summary, Trafodion, and EsgynDB in extending Trafodion capabilities, incorporate a number of enhancements over vanilla HBase for the purposes of improving transaction performance, data integrity, and DBA/developer productivity, while reducing application complexity through the use of standard and well known relational practices and APIs.

EsgynDB feature overview

Let’s now look at a high level overview of the Trafodion features. A more detailed drill down of each of these features is provided in the sections below. Trafodion includes:

• An enterprise-class SQL DBMS that provides all of the features you would expect from one of the merchant relational database products that are on the market. The difference is that Trafodion leverages Hadoop services i.e. HBase/HDFS for elastic scale, lower total cost of ownership, integration with semi-structured and unstructured data in HDFS, and reduced latency and duplication of data across proprietary operational deployments and workloads on Hadoop that need that data.

• Full-functioned ANSI SQL language support including data definition, data manipulation, transaction control, and database utilities. This includes features such as time based clustering of data, Referential Integrity, Stored Procedures, and User Defined Functions, amongst many others.

• Linux and Windows ODBC/JDBC drivers.

• Integration with Hibernate to provide ORM application development support for users that want to leverage an object model for application development. ORM models entities based on real business concepts rather than based on the database structure, thereby making application development easier.

• Distributed transaction management protection, including cross data center active-active support as provided by EsgynDB.

• Many SQL optimizations designed to improve operational and reporting workload performance.

All while retaining and extending expected Hadoop benefits! Now let’s dive into more details on these features.

Full-functioned ANSI SQL language support

Unlike most (if not all) NOSQL and other SQL-on-Hadoop products, Trafodion provides comprehensive ANSI SQL language support including full-functioned data definition (DDL), data manipulation (DML), transaction control (TCL) and database utility support.

• Trafodion provides support for creating and managing traditional relational database objects including tables, views, secondary indexes, check constraints, unique constraints, and Referential Integrity constraints. This includes self-referential foreign keys, allowing enforcement of referential integrity for hierarchical relationships, such as ensuring that the manager assigned to an employee exists as an employee as well. Vanilla HBase is schemaless and requires applications to enforce relational interdependence.

• Columns (table attributes) are assigned Trafodion enforced data types including numeric, character, varchar, date, time, interval, etc. Internationalization (I18N) support is provided via Unicode encoding including UTF-8, UCS2, and ISO 8859-1 for both user data as well as the database metadata. Comparisons and data manipulation between differing data encodings is transparently handled via implicit casting and translation.

• Trafodion provides comprehensive and standard SQL data manipulation support including SELECT, INSERT, UPDATE, DELETE, and UPSERT/MERGE syntax with language options including join variants, unions, where predicates, aggregations (group by and having), sort ordering, sampling, correlated and nested sub-queries, cursors, and many SQL functions.

• Utilities are provided for updating table statistics used by the optimizer for costing (i.e. selectivity / cardinality estimates) plan alternatives, for displaying the chosen SQL execution plan, plan shaping, and a command line utility for interfacing with the database engine.

• Explicit control statements are provided to allow applications to define transaction boundaries and to abort transactions when warranted.


• ANSI’s GRANT/REVOKE semantics are used to define user and role privileges down to the column level, in terms of managing and accessing the database objects.

Divisioning

Trafodion also supports a concept called “divisioning”. This enables creating a column based on an expression, such as month extracted from a date or timestamp column value, whereby data related to a specific time period, such as a month, can be clustered together within a region, or salt partition. This column follows the salt key column. This facilitates clustering of data by a time period commonly accessed by queries, to provide fast access times. It also facilitates the ease of deletion of data that can be aged out. MDAM access (discussed below) ensures good access to key columns following the salt and divisioning columns.

Stored Procedures

Trafodion supports Stored Procedures in Java (SPJ) so that users can write operational or business procedures in Java code, to be executed on the server side, when invoked by client applications. When there are multiple SQL statement executions as part of a procedure, with data needing to be exchanged between the client and server for each of these statements, it is more efficient to use stored procedures to push that processing to the server side. Oracle PL/SQL and ANSI SQL stored procedures can be converted to SPJs.

User Defined Functions

Trafodion supports User Defined Functions (UDFs) as many other full-function DBMSs do. Besides supporting UDFs in C++ and Java, Trafodion supports Table Mapping UDFs or TMUDFs , that allow users to code MapReduce-style algorithms. ANSI has a proposal to formalize a standard around a similar concept, calling it Polymorphic Table Functions (PTFs). These are very powerful UDFs whereby they can take in tables or streams of data, and can also return tuples instead of just scalar values as regular UDF functions do. They can also be programmed to return a dynamic set of columns, determined by the UDF at initial invocation. The resulting rows from such a UDF can be treated like a table or sub-query within a query.

These functions can enable integration with other solutions in the Hadoop ecosystem, such as streaming data from Kafka, or accessing in-memory data in Spark RDDs, doing text searches via SOLR, connecting to any JDBC data source for data, or even creating, deleting, searching, and reading MongoDB documents.

Loading tools

Trafodion provides high-speed bulk loading to load data from HDFS textfiles directly into Trafodion, leveraging the HBase bulk loading capability. This utility works with indexed tables and provides error handling capabilities. There is also a fast parallel load tool, called ODB, which facilitates loading data directly from other RDBMSs, in parallel, to Trafodion. Most ETL tools, such as Pentaho, that use ODBC or JDBC to connect to databases, can be used with Trafodion.

Manageability

EsgynDB provides a database manager. Here are some of the capabilities built into the EsgynDB Manager:

Workload visibility and control

• View active queries and query statistics to understand which queries are impacting workload performance

• Cancel a query if it looks like it may be impacting other workloads running on the system

• Historical query statistics and query plans, for forensic analysis, to understand performance problems in order to take corrective action; or for capacity planning for future increase in hardware resources to improve SLAs or accommodate growth

• Query Workbench to execute ad hoc queries and to analyze query plans to optimize them for future execution

System Monitoring and Health Checks

• Dashboard to display status of core subsystems, such as configured processes that are running or not running, or nodes that are up or down, to take immediate action to address failures

• Transaction counts (aborts / commits / begins) as a time series, to understand transaction performance and growth over time, or to address increases in transaction aborts

• View all available connection servers and connected sessions to help monitor which applications and users are currently connected, to assess the health of the system and its availability to users and applications

• Canary query response time as a time series: Canary queries are run every 5 minutes to measure scan / write times to assess system performance. If the response times for these queries vary over a threshold from their normal response times, that could indicate a degradation of system performance that the user can be alerted to, to take immediate corrective action.


• Key system metrics such as IO waits, HBase Region Server memory usage, disk space usage, garbage collection time, graphed as time series, to analyze system performance as the system is running; or from a historical perspective, to detect issues and take action; or to understand growth for capacity planning, especially as new applications are deployed.

• Email generation or HTTP alerts when metrics exceed thresholds, so that variations from the norm that need attention can be immediately addressed by the appropriate people

• View events / logs for all EsgynDB components for forensic analysis, understand problem areas, or even to monitor security violation

Other capabilities

• Database authentication for all SQL management functions to ensure that the right users have secure access to the right management functions

• Persistence / High Availability of the management infrastructure, so that when the system is under stress or is impacted by failure, the management infrastructure is still available to analyze and address the problems

• REST server to automate or expose any needed action, providing flexibility for users to integrate the manageability environment into their own management deployments or to automate management tasks

EsgynDB software architecture overview

The Trafodion software architecture consists of 3 distinct layers: the client layer; the SQL database services layer; and the storage engine layer (see Figure 3).

Figure 3. Trafodion's 3-layer software architecture

The first layer is the Client Services layer where the operational application resides. The operational application can be either customer written or enabled via a 3rd party ISV tool / application. Access to the Trafodion database services layer is completed via a standard ODBC / JDBC interface using a Trafodion supplied Windows or Linux client driver. Both type 2 and type 4 JDBC drivers are supported and the choice is dependent on the application requirements for response times, number of connections, security, and other factors. EsgynDB extends that support with an ADO.NET driver.

The second layer is the SQL layer which consists of the all the Trafodion database services. This layer encapsulates all of the services required for managing Trafodion database objects as well as efficiently executing submitted SQL database requests. Services include connection management, SQL statement compilation and optimized execution plan creation, SQL execution (both parallel and non parallel) against Trafodion database objects, transaction management, and workload management. Trafodion provides transparent parallel SQL execution as warranted thereby eliminating the need for complex map-reduce programming development.


The third layer is the Storage Engine layer which consists of standard Hadoop services that are leveraged by Trafodion including HBase, HDFS, and Zookeeper. Trafodion database objects are stored into native Hadoop (HBase / HDFS) database structures. Trafodion handles the mapping of SQL requests into native HBase calls transparently on behalf of the operational application. Trafodion provides a relational schema abstraction on top of HBase. In this way traditional relational database objects (tables, views, secondary indexes) are supported using familiar DDL/DML semantics including object naming, column definition and data type support, etc.

Integrating with native Hive and HBase data stores

One of the more powerful capabilities of Trafodion is its extensibility to also support and access structured and semi-structured data stored in native Hive or HBase tables (non-Trafodion tables) using their native storage engines and data formats. The benefits that can be realized include:

• Ability to run queries against native HBase or Hive tables without needing to copy them into a Trafodion table

• Ability to update Trafodion tables and native HBase tables, with full ACID transactional support for both kinds of tables in the same transaction, regardless of the number of rows updated or statements executed within that transaction – it’s all or nothing

• Optimized access to HBase and Hive tables

• Ability to join data across disparate data sources (e.g. Trafodion, Hive, HBase)

• Ability to leverage HBase’s inherent schema flexibility capabilities – an add or drop column just updates the Trafodion metadata, without having to reload the underlying data

Process overview and SQL execution flow

The Trafodion SQL Layer is comprised of a number of services or processes used for the purposes of handling connection requests and SQL execution.

• The process flow begins with the operational application or 3rd party client tool. The Windows or Linux client accesses the Trafodion DBMS via supplied ODBC/JDBC drivers.

• When the client requests a connection, Trafodion’s database connection services (DCS) process the request and assigns the connection to a Trafodion Master SQL process. Trafodion uses Zookeeper to coordinate and manage the distribution of connection services across the cluster for load-balancing purposes, as well as to ensure that a client can immediately reconnect in the event the assigned Master process should fail.

• The Master process coordinates the execution of SQL statements passed from the client application.

• The Master calls upon the Compiler and Optimizer process (CMP) to parse, compile, and generate the optimized execution plan for the SQL statements.

• If the optimized plan calls for parallel execution, the Master divides the work among Executor Server Processes (ESP) to perform the work in parallel on behalf of the Master process. The results are passed back to the Master for consolidation. For complex queries (e.g. large n-way joins or aggregations), multiple layers of ESPs may be requested. If a non-parallel plan is generated, then the Master calls upon HBase services directly for optimal performance.

• For distributed transaction protection, the Trafodion DTM service is called upon to ensure the ACID protection of transactions across the Hadoop cluster. DTM executes these transactions, potentially distributed across region servers, via deep integration into HBase using coprocessors to manage the transaction context, detect transactional conflicts, write transactional data, and perform transactional recovery.

• Finally, vanilla HBase and HDFS services are called upon by either the Master or ESP processes using standard and native API’s to complete the I/O requests i.e. retrieving and maintaining the database objects. Where appropriate Trafodion will push down SQL execution into the HBase layer using Filters or Coprocessors.


Optimizer technology

Optimizer technology represents one of Trafodion’s greatest sources of differentiation versus alternative SQL-on-Hadoop projects or products. There are two primary areas to call out: the first is the extensible nature of the optimizer to adapt to change and add improvements; and the second is the sophistication and maturity level of the optimizer to choose the best optimized plan for execution.

Extensible optimizer technology

Trafodion’s optimizer is based on the Cascades optimization framework. Cascades is recognized as one of the most advanced and extensible optimizer frameworks available. The Cascades framework is a hybrid optimization engine, in that it combines logical and physical operator transformation rules with costing models to generate the Trafodion Optimizer.

New rules or new costing models can be easily added or changed to generate an improved optimizer. In this way, the optimizer can quickly evolve and new operators can be rapidly added or changed to generate improved SQL optimization plan generation.

Optimized execution plans based on statistics

The second area of differentiation is the sophistication and maturity level of Trafodion’s optimizer technology. First let’s explain the role of the various elements of the optimizer:

SQL Normalizer – the parsed SQL statement is passed to the normalizer which performs unconditional

transformations, including subquery transformations, of the SQL into a canonical form which renders the SQL in a form that can be optimized internally.

SQL Analyzer – analyzes alternative join connectivity patterns, table

access paths and methods, matching partition information, etc. to be used by the optimizer’s rules. The results are passed to the plan generator for consideration in costing various plan alternatives.

Table Statistics – captured equal-height histogram statistics identify data

distributions for column data and correlations between columns. Sampling is used for large tables to reduce the overhead of generating the statistics.

Cardinality Estimator – cardinalities, data skew, and histograms are

computed for intermediate results throughout the operator tree.

Cost Estimator – estimates Node, I/O, and message cost for each

operator while accounting for data skew at the operator level.

Plan Generator – using cost estimates the optimizer considers alternative plans and chooses the plan which has the lowest cost. Where feasible the optimizer will select plans that incorporate SQL pushdown, sort elimination, and in-memory storage vs. overflow to disk. Also it determines the optimal degree of parallelism including non-parallel plans.

In summary, the optimizer is designed to choose the execution plan that minimizes the system resource used and delivers the best response time. It provides optimizations for both operational transactions and reporting workloads.

Optimizer Differentiating Capabilities

Large Scope Rules

Some other databases also use the Cascades framework. However, Cascades can result in long compile times since the search space for an optimal plan can be farily large for a complex query. In such cases optimizers use rules or heuristics to reduce that search space. Trafodion has an added capability to do this, called Large Scope Rules, which detect query patterns, and can use them to reduce the search space dramatically. For example, if there are a number of small dimension tables being joing to a fact table, most optimizers will favor a full fact table scan with subsequent hash joins to these dimension tables. Trafodion on the other hand will consider a cross-product join amongst the results of the dimension tables after applying predicates, and then consider a nested join to the fact table. This can quickly result in a plan with efficient access to the fact table.


Trafodion also uses these techniques with a branch and bound strategy, where it tries to get a good enough plan at the onset in the above described fashion and then quickly abandon searching for more plans when it does not find a plan that is better – reaches a point of diminishing returns.

Skew Buster A huge issue in parallel query execution, especially as the number of nodes scales up to a large number, is dealing with skew in the data, not necessarily at the storage level where it can be managed by good partitioning design strategies, but at the query operation level, such as when a join or aggregation is being performed. In this case you could have a query running on 100 nodes and most of the data ends up being processed by a single node (or a handful of nodes) due to skew. This can result in the query taking a long time to execute, as well as skewed utilization of cluster resources, impacting all other queries running concurrently. To address this, Trafodion has a way to detect skew at any level in the execution tree, using equal height histograms that are particularly good at revealing skew, by calculating cardinalities at all levels of the execution tree. It then uses various repartitioning and broadcasting of inner and outer child strategies, handling skew values differently from the rest of the values, to eliminate skew, such that queries execute in minimal time, use minimal system resources, and do not impact other concurrent workloads.

Adaptive Segmentation Trafodion can execute any operation of the query using a different degree of parallelism. It uses the histogram statistics to estimate the cardinality of the rows that it will need to process at each operational step (join, aggregation, etc.), and then determines the degree of parallelism based on this cardinality. Using that information, it determines the degree of parallelism for the entire query. For example, it may find that on a 100 node cluster it needs to run a specific query only across 5 of the nodes, for the estimated number of rows being processed by the query. This results not only in using the appropriate level of system resources, such as memory, compute, and messaging, but facilitates the execution of many more queries concurrently than would otherwise have been possible. If there is skew in a query, the lower degree of parallelism addresses that as well. If there is a node failure, not all queries are impacted, as compared to a database where all the nodes are used to execute all the queries. The illustration above demonstrates that on a 128 node cluster queries are running at 32, 64, and 128 degrees of parallelism. Trafodion tries to limit the segments it uses for adaptive segmentation, so that it is easier to load balance the queries across those segments, leveraging the segments not utilized as much. Having this capability is invaluable as the cluster is expanded with newer nodes having more compute, memory, and I/O bandwidth, to run more queries than on older segments of the cluster.

Data flow SQL executor technology with optimized Degree of Parallelism

Trafodion’s SQL executor uses a dataflow and scheduler-driven task model to execute the optimized query plan. Each operator of the plan is an independent task and data flows between operators through in-memory queues (up and down) or by interprocess communication. Queues between tasks allow operators to exchange multiple requests or result rows at a time. A scheduler coordinates the execution of tasks and runs whenever it has data in one of its input queues. Trafodion’s executor model is starkly different from alternative SQL-on-Hadoop implementations that store intermediate results on disk. In most cases, the Trafodion executor is able to process queries with data flowing entirely through memory, providing superior performance and reduced dependency on disk space and I/O bandwidth. Only for a large hash join or sort, where Trafodion detects memory pressure, does it gracefully overflow to disk. The executor incorporates several types of parallelism, such as:


• Partitioned parallelism which is the ability to work on multiple data

partitions in parallel. In a partitioned parallel plan, multiple operators all work on the same plan. Results are merged by using multiple queues, or pipelines, enabling the preservation of the sort order of the input partitions. Partitioning is also called “data parallelism” because the data is the unit that gets partitioned into independently executable fractions. Each ESP can access a single region or multiple regions. Multiple ESPs can even access parts of the same region or salt partition. This depends on the optimial access decided by the optimizer based on cardinalities, number of nodes in the cluster, and number of cores per node.

• Pipelined parallelism is an inherent feature of the executor resulting from its dataflow architecture. This

architecture interconnects all operators by queues with the output of one operator being piped as input to the next operator, and so on. The result is that each operator works independently of any other operator, producing its output as soon as its input is available. Pipelining occurs naturally and is engaged in almost all query plans. The only blocking operator in this data flow is if a sort is required, since it needs to complete before downstream operators can process data.

• Operator parallelism is also an inherent feature of the executor architecture. In operator parallelism, two or

more operators can execute simultaneously, that is, in parallel. Except for certain synchronization conditions, the operators execute independently.

Trafodion naturally provides parallelism without special processing such as Hadoop map-reduce programming or coding on the part of the application client. An individual query plan produced by the optimizer can contain any combination of partitioned, pipelined, or operator parallelism. The degree of parallelism at any plan stage may vary depending on the cardinality of rows being processed at that stage and the optimizer’s heuristics.

Optimizations for transactional SQL workloads

Trafodion provides many compile and run-time optimizations for varying operational workloads ranging from singleton row accesses for OLTP like transactions to highly complex SQL statements used for reporting purposes. Figure 4 depicts a number of these optimization features:

• A Type 2 JDBC driver may be used which provides the client direct JNI access to HBase services to minimize service times

• For many OLTP like transactions, the Master can issue “directed” key access requests to HBase without needing intermediate ESP processes.

• For transactions including highly complex SQL statements (e.g. n-way joins or aggregations requiring rebroadcasting or redistribution of data), a parallel plan involving ESPs or multi-layers of ESP’s can be used to significantly reduce the service time.

Additional optimizations include:

• Masters and ESPs are retained after a connection is dropped and can be reused, thereby eliminating the startup and shutdown overhead.

• Multiple ESPs on a node are combined into a single multi-segment process to reduce the number of processes.

• Compiled SQL plans are cached thereby eliminating unnecessary recompilation overhead. This caching is done at various stages of the compilation process and is not simply a text matching capability. Different plans are cached for similar queries if cardinalities of the values in their predicates differ substantially resulting in different execution plans.

Figure 4. Optimized parallel execution


• SQL pushdown using standard HBase services such as filters (e.g. start-stop key predicates and non-key predicates) and coprocessors (e.g. count aggregates).

• Secondary index support.

• Multidimensional Access Method (MDAM) to accelerate row retrieval performance using “dimensional” predicates. For example assume you have a table where the row-key is Week, Item, and Store but the application supplies only Item and Store predicates. Without MDAM, this would mean that the the DBMS must perform a full table scan or a secondary index on item and store would have to be created. In contrast, MDAM utilizes the inherent HBase clustering row-keys to issue a series of probes and range jumps through the table, reading only the minimal set of rows required to process the SQL statement. MDAM usage can be extended to a broad range of data retrieval requests (e.g. IN lists on multiple key index columns, NOT equal (<>) predicates, multivalued predicates, etc.) thus improving response times and reducing the need for additional secondary indexes. It is also used to access tables with a “salted” row key efficiently, as well as for “divisioning”.

• Rowsets support, which is the ability to batch multiple SQL statements in a single request, thus reducing the number of message exchanges between the client and the database engine.

• Availability enhancements including service persistence (via Zookeeper) and automatic query resubmission.

Figure 5 below summarizes many of the Trafodion optimizations discussed to this point. This demonstrates that Trafodion provides optimizations for both operational transaction workloads that typically have very stringent response time requirements (e.g. sub-second in nature) as well as operational query and reporting workloads that typically have more relaxed response time requirements (e.g. minutes to hours) and may include SQL statements that require highly complex SQL operations that are best run in a parallel manner.

Figure 5. Trafodion workload optimizations

Operational transactions

Key-based access optimizations

Pushdown technology e.g. filters, coprocessors

Optimizations using database statistics that identify skewed data distributions

Efficient access via non-parallel and parallel plans

Compiler and execution speedup using native SQL expression optimizations

Query plan caching to eliminate unnecessary recompilations

Secondary index support with parallel access and maintenance

Multiple ODBC/JDBC drivers in support of varying configuration and performance requirements

Transparent leveraging of HBase API optimizations

Operational queries and reporting

Massive parallelism invoked automatically for large complex queries

Parallel query execution using ESP and multi-level ESP parallelism

In-memory vs. big memory operation overflow optimizations

Parallel n-way join and aggregation algorithms including hybrid hash, nested, merge, etc.

Table structure optimizations including salted keys and compressed column name encoding

Rowset support minimizes the impact to the network and database when retrieving or inserting a large batch of rows

EsgynDB innovation - Distributed Transaction Management

Vanilla HBase provides only single table, row level ACID protection. Trafodion’s distributed transaction management (DTM) extends transaction protection to transactions spanning multiple SQL statements, multiple tables, or multiple rows of a single table. Additionally the Trafodion DTM provides protection in a distributed cluster configuration across multiple HBase regions using an inherent 2-phase commit protocol. Transaction protection is automatically propagated across Trafodion components and processes.


The DTM provides support for implicit (auto-commit) and explicit (BEGIN, COMMIT, ROLLBACK WORK) transaction control. Using HBase’s Multi-Version Concurrency Control (MVCC) algorithm, Trafodion allows multiple transactions to be accessing the same rows concurrently. However, in the case of update, the first transaction to complete wins and other transactions are notified at commit that the transaction failed due to update conflict.

Figure 6: Distributed Transaction Management Architecture

Trafodion has a distributed transaction management architecture that is completely scalable. There is a transaction manager on each node, thus distributing the transaction coordination to the node initiating the transaction. Transaction workloads are balanced across nodes by the Database Connectivity Services. The context for the transaction is managed in the HBase coprocessor at the region level. Each region manages transactional updates and conflict resolution of data at its local level. Recovery is done in parallel as well and is very efficient. Details of this architecture are covered in a separate white paper on DTM.

High availability and data integrity features

Trafodion leverages the inherent availability and data integrity features of HBase and HDFS.

Hadoop, HDFS, HBase HA Benefits

Name Node Redundancy Protects against a name node failure

HBase Replication (asynchronous) Copies data between HBase deployments providing a disaster recovery solution

HDFS Replication (data block copies) Provides data protection against node and disk failures or data corruption

HBase Snapshot Takes a snapshot of a table version at a particular point in time, enabling

recovery back to that point in time for that table

Zookeeper Enables highly reliable distributed coordination of Hadoop hosted services

Additionally, Trafodion can leverage any Hadoop distribution provided enterprise-class availability extensions that may be offered. On top of the HBase and HDFS offered features, Trafodion provides a number of high availability features including: • Persistent connectivity services to ensure that a client is able to reestablish a connection in the event it’s DCS

service fails

• Automatic query resubmission (AQR) which resubmits a failed SQL statement under certain conditions

• Transactional support across HBase Region Server splits and rebalancing, so that your application is always online and uninterrupted, as the data in HBase is split and rebalanced across regions for better performance and scale

• Dramatically reduced backup window to minutes as opposed to hours, moving towards full online backup


Active-Active zero lost transactions across data centers

EsgynDB extends the Trafodion High Availability story by providing synchronous replication of full SQL ACID transactional updates across multiple cluster, potentially across multiple data centers, for selected tables. This ensures that no transactional updates will be lost in case of a disaster. It supports active-active multiple masters. Meaning that both reads and writes can be performed on these multiple clusters across data centers. This facilitates scaling workloads across clusters and data centers, provides local access to data, and supports safe harbor capabilities. A blog post covers this topic in more detail.

Summary of EsgynDB benefits

Trafodion delivers on the promise of a full featured and optimized transactional SQL-on-Hadoop DBMS solution with full transactional data protection. This combination of HBase and an enterprise-class transactional SQL engine addresses Hadoop’s weaknesses in terms of supporting operational workloads.

Customers gain the following recognized benefits:

• Ability to leverage their in-house SQL learnings and expertise versus having to learn complex map/reduce programming.

• Seamless support for existing and new customer written or ISV operational applications drives investment protection and improved development productivity.

• Workload optimizations provide the foundation for the delivery of next generation real-time transaction processing applications.

• Guaranteed immediate transactional consistency across multiple SQL statements, tables, and rows.

• Full disaster recovery protection with zero lost transactions and ability to scale reads and writes across clusters spread across data centers.

• Complements exisiting Hadoop investments and benefits - reduced cost, scalability, and elasticity.

• All with open source project sponsors

About Esgyn and EsgynDB

Esgyn Corporation’s mission is to empower organizations to deploy new kinds of Big Data solutions. Esgyn is the leading contributor to the Apache Trafodion project, with an engineering team that has over 450 years of experience in developing massively parallel database technology. Esgyn’s premier offering is EsgynDB Enterprise, a hardened, secure, enterprise-class SQL-on-Hadoop solution built on Apache Trafodion. With offices in Silicon Valley and Shanghai, Esgyn offers support, services and training for EsgynDB that enterprises expect for their production environments.

For more information visit www.esgyn.com or email [email protected].

© 2015, 2016 Esgyn Corporation. Published January 2015

esgyndbdbms capable of hosting these applications and their data. esgyndb innovations built upon...

Documents