how auto microcubes work with indexing & caching to deliver a consistently fast business...

About Jethro

SQL

Data

• What Does Jethro Do?– BI on Big Data acceleration– Reporting, dashboards, discovery, ad-

hoc

• How It Works?– Indexing and caching server– Combines columnar SQL DB design

with search-indexing technology

• Partnerships– BI: Tableau, Qlik– Hadoop: Cloudera, MapR, Hortonworks

SQL on Hadoop – Complimentary Approaches

• Hive / Tez• Impala• Presto• SparkSQL• Drill

• HAWQ• IBM/Big SQL• Actian• Tajo• …

SQL-on-Hadoop SolutionsFull-Scan: Read all rows

• JethroData

JethroDataIndex-Access: Read ONLY needed rows

Comparison:Full-Scan: Optimal for predictive & reportingIndex-Access: Optimal for interactive BI

What Is Jethro for BI Tools?An indexing & caching server• BI tool uses live DB access

– Sends SQL queries via ODBC / JDBC

• Jethro key performance features1. Full indexing – every column is indexed2. Result cache – every query is cached3. Auto Cubes – every repeatable pattern

• Everything stored in Hadoop– Cache, aggregations, index & column files, …

• Incrementally updated– Every day / hour / min

Live Access

HDFS

BI Tool

Jethro: Enabling Unlimited Interactive BI for Big Data

Unlimited

Interactive

Big Data

Low Conccurency

Interactive

Slow

MPP speed =

more resources

Jethro:Hi Speed

Low resources

HiConccurency

Interactive BI requires both speed and conncurency

Faster

• Indexes• Hi Performance Execution• Results cache• Auto Cubes

Indexes – boosting filtered queries

• Indexes everything – every column, every value• Filtering (where clause expression) is done against the indexes• The more you filter the faster you get the results – execution

time depends on size of scanned data set• Resources required per query are order or manganite lower

which enable high concurrency

What to do when indexes are not enough?

• The Challenge: How to provide interactive response time (seconds) for use cases that include wide queries with little or no filtering

• Our Approach:Add CUBES technology which is complimentary to INDEXES

• Jethro rule:Make this absolutely seamless to the user

Traditional OLAP Cubes – The Short StoryCube: Select City, Item, Year, sum(sold_price) from Sales group by City, Item, Year

Queries that can use the cube: Select City, Item, sum(sold_price) from Sales where Year=2016 group by

City, Item Select Item, sum(sold_price) from Sales group by Item

City Item Year sum(sold_price)NY iPhone7 2016 $50,000NY Samsung7 2016 $40,000NY iPhone6 2015 $42,500LA iPhone7 2016 $70,000LA Samsung6 2015 $35,000

Traditional OLAP Cubes

• Performance:Fast response time for queries that hit the cube

• Concurrency: Low resource footprint per query enabling high concurrency

• Use Case:Works great for static query environment

Not suitable for dynamic environments that support self service and complex dashboards

Traditional OLAP Cubes: Challenges• Hard to implement: Manually pre-defined, requires specialized tools and expertize• Resources consuming: Heavy processing on cubes creation that can effect global system

performance• Operational overhead: keeping cube up to data with source data is time and resource

consuming• Use case limitations: Size limitation and operational limitations that make it practically

impossible to use for many use cases, Such as:– Large number of dimensions– High cardinality dimension– Count distinct aggregators– Complex expressions– Many different queries

How to have your cake and eat it too• Auto generated cubes– Cubes are automatically generated in the background based on actual user

interaction – No expertise, no specialize tools, no pre design– Unlimited access to the data

• Micro Cubes– Many Micro cubes instead few gigantic cubes– Easily support many different queries

• Incremental– Auto cubes are incremental and automatically updated– Zero operational overheads – Stable performance unaffected by ongoing new data streaming

How to have your cake and eat it too, cont.• Complex queries normalization– Rewrite complex queries to reuse simplified common query blocks– Increase cubes reusability

• Optimized for count distinct– Handling for count distinct using values bitmaps– Handle count distinct without hitting cube size limitation

• Complementary to indexes– Use indexes for large number of filters or hi cardinality dimensions:– Maintain stable interactive performance by utilizing complementary

index and cubes

Jethro Query Processing FlowQuery Arrives

Query Match

?

CubeMatch

?

Process Query(Indexes, Columns,

MT execution)

Optimal for

Cube?

Cache Results

Generate Cube

Response from Results Cache

Response from Cubes

reply

reply

replyNo

No

Yes

Yes

LowRepeatability

Hi-filter

Mid-filter

No-filter

Results Cache

Indexed Based Query

Execution

AutoCubeJethro

2.0

HiRepeatability

MidRepeatability

Jethro: Consistently Fast Queries

DEMO

• TCP-DS data set• Single table: 1.2 billion rows• Multi tables: 1.6 billion rows fact• 2 Jethro nodes (AWS r3.4xl) over EFS• BI: Tableau

how auto microcubes work with indexing & caching to deliver a consistently fast business...

Software