how auto microcubes work with indexing & caching to deliver a consistently fast business...
TRANSCRIPT
About Jethro
SQL
Data
• What Does Jethro Do?– BI on Big Data acceleration– Reporting, dashboards, discovery, ad-
hoc
• How It Works?– Indexing and caching server– Combines columnar SQL DB design
with search-indexing technology
• Partnerships– BI: Tableau, Qlik– Hadoop: Cloudera, MapR, Hortonworks
SQL on Hadoop – Complimentary Approaches
• Hive / Tez• Impala• Presto• SparkSQL• Drill
• HAWQ• IBM/Big SQL• Actian• Tajo• …
SQL-on-Hadoop SolutionsFull-Scan: Read all rows
• JethroData
JethroDataIndex-Access: Read ONLY needed rows
Comparison:Full-Scan: Optimal for predictive & reportingIndex-Access: Optimal for interactive BI
What Is Jethro for BI Tools?An indexing & caching server• BI tool uses live DB access
– Sends SQL queries via ODBC / JDBC
• Jethro key performance features1. Full indexing – every column is indexed2. Result cache – every query is cached3. Auto Cubes – every repeatable pattern
• Everything stored in Hadoop– Cache, aggregations, index & column files, …
• Incrementally updated– Every day / hour / min
Live Access
HDFS
BI Tool
Jethro: Enabling Unlimited Interactive BI for Big Data
Unlimited
Interactive
Big Data
Low Conccurency
Interactive
Slow
MPP speed =
more resources
Jethro:Hi Speed
Low resources
HiConccurency
Interactive BI requires both speed and conncurency
Faster
• Indexes• Hi Performance Execution• Results cache• Auto Cubes
Indexes – boosting filtered queries
• Indexes everything – every column, every value• Filtering (where clause expression) is done against the indexes• The more you filter the faster you get the results – execution
time depends on size of scanned data set• Resources required per query are order or manganite lower
which enable high concurrency
What to do when indexes are not enough?
• The Challenge: How to provide interactive response time (seconds) for use cases that include wide queries with little or no filtering
• Our Approach:Add CUBES technology which is complimentary to INDEXES
• Jethro rule:Make this absolutely seamless to the user
Traditional OLAP Cubes – The Short StoryCube: Select City, Item, Year, sum(sold_price) from Sales group by City, Item, Year
Queries that can use the cube: Select City, Item, sum(sold_price) from Sales where Year=2016 group by
City, Item Select Item, sum(sold_price) from Sales group by Item
City Item Year sum(sold_price)NY iPhone7 2016 $50,000NY Samsung7 2016 $40,000NY iPhone6 2015 $42,500LA iPhone7 2016 $70,000LA Samsung6 2015 $35,000
Traditional OLAP Cubes
• Performance:Fast response time for queries that hit the cube
• Concurrency: Low resource footprint per query enabling high concurrency
• Use Case:Works great for static query environment
Not suitable for dynamic environments that support self service and complex dashboards
Traditional OLAP Cubes: Challenges• Hard to implement: Manually pre-defined, requires specialized tools and expertize• Resources consuming: Heavy processing on cubes creation that can effect global system
performance• Operational overhead: keeping cube up to data with source data is time and resource
consuming• Use case limitations: Size limitation and operational limitations that make it practically
impossible to use for many use cases, Such as:– Large number of dimensions– High cardinality dimension– Count distinct aggregators– Complex expressions– Many different queries
How to have your cake and eat it too• Auto generated cubes– Cubes are automatically generated in the background based on actual user
interaction – No expertise, no specialize tools, no pre design– Unlimited access to the data
• Micro Cubes– Many Micro cubes instead few gigantic cubes– Easily support many different queries
• Incremental– Auto cubes are incremental and automatically updated– Zero operational overheads – Stable performance unaffected by ongoing new data streaming
How to have your cake and eat it too, cont.• Complex queries normalization– Rewrite complex queries to reuse simplified common query blocks– Increase cubes reusability
• Optimized for count distinct– Handling for count distinct using values bitmaps– Handle count distinct without hitting cube size limitation
• Complementary to indexes– Use indexes for large number of filters or hi cardinality dimensions:– Maintain stable interactive performance by utilizing complementary
index and cubes
Jethro Query Processing FlowQuery Arrives
Query Match
?
CubeMatch
?
Process Query(Indexes, Columns,
MT execution)
Optimal for
Cube?
Cache Results
Generate Cube
Response from Results Cache
Response from Cubes
reply
reply
replyNo
No
Yes
Yes
LowRepeatability
Hi-filter
Mid-filter
No-filter
Results Cache
Indexed Based Query
Execution
AutoCubeJethro
2.0
HiRepeatability
MidRepeatability
Jethro: Consistently Fast Queries
DEMO
• TCP-DS data set• Single table: 1.2 billion rows• Multi tables: 1.6 billion rows fact• 2 Jethro nodes (AWS r3.4xl) over EFS• BI: Tableau