presto
TRANSCRIPT
Content
• Background
• Architecture
• Key points for low query latency
• What we do
• Reference
Thursday, 17 April, 14
Background
• 300+PB data stored in Hadoop/HDFS-based clusters
• More queries and get results faster improves analysts, data scientists, and engineers productivity
• MapReduce and Hive are designed for large-scale, reliable computation
• External projects too nascent or did not meet our requirements for flexibility and scale
Thursday, 17 April, 14
Architecture
Thursday, 17 April, 14
Key points for low latency
• In memory parallel computing
• Pipeline
• Data local computation
• Data cache
• Dynamic compile part of the plan to byte code
• Careful use of memory and data structure
• BlinkDB liked approximate queries
• Traditional SQL optimize
• GC controlThursday, 17 April, 14
Compile flow
Thursday, 17 April, 14
In memory parallel computingselect c1.rank, count(*) from dim.city c1 join dim.city c2 on c1.id = c2.id
where c1.id > 10 group by c1.rank limit 10;
Thursday, 17 April, 14
In memory parallel computing
Thursday, 17 April, 14
In memory parallel computing
Thursday, 17 April, 14
In memory parallel computing
• PlanDistribution=Source– InputSplit[] splits =
inputFormat.getSplits(jobConf, 0);
• PlanDistribution=Hash– Hash Shuffle– Fixed Workers– query.initial-hash-partitions
Thursday, 17 April, 14
SplitRunner thread number task.shard.max-threads=availableProcessors() * 4
Pipeline - TaskExecutor
Thursday, 17 April, 14
Pipeline - Operator process flow
Page(max page size: 1MB, max rows: 16 * 1024 )
Thursday, 17 April, 14
Pipeline - ExchangeOperator
Thursday, 17 April, 14
Data local computation
• Select acceptable nodes (as least 10 nodes by default)– Nodes has the same address– If not enough, add nodes in the same rack– If not enough, randomly select nodes in other racks
• Select the node with the smallest number of assignments (pending tasks)
Thursday, 17 April, 14
Data cache
• Google Guava LoadingCache• Cached Objects– HiveMeta database table partition– Byte Code Class
FilterAndProjectOperatorFactoryFactory, ScanFilterAndProjectOperatorFactoryFactory
– functions
Thursday, 17 April, 14
Dynamic compile plan to byte code
• Presto dynamic compile FilterAndProjectOperator and ScanFilterAndProjectOperator to byte code which lets the JIT optimize and generate native machine code
• How much does it speed up ?• ScanFilterAndProjectOperator
Thursday, 17 April, 14
Careful use mem & data structure
• Slice– Unsafe#copyMemory– 20% ~ 30% speed up for ORCFile write performance
• ThreadLocalRandom– ThreadLocal seed instead of AtomicLong– 100% speed up
• ListenableFuture– Async Callback
Thursday, 17 April, 14
Approximate queries
• approx_avg, approx_distinct, approx_percentile• +50% speed up
Thursday, 17 April, 14
Traditional SQL optimize
• ImplementSampleAsFilter• LimitPushDown• MaterializeSamplePullUp• MergeProjections• PredicatePushDown• PruneRedundantProjections• PruneUnreferencedOutputs• SetFlatteningOptimizer• SimplifyExpressions• UnaliasSymbolReferences
Thursday, 17 April, 14
GC control
• A JDK 1.7 BUG • When code cache fills up, there is a chance that JIT
might stop compile byte code to native code.• By forcing classes to unload from the perm gen,
we let the code cache evictor make room before the cache fills up.
• System.gc()
Thursday, 17 April, 14
What we do
• Support kerberos authentication
• Implicit type coercion
• Support reading lzo compressed tables
• Implement useful functions
• Fix planning issue when using DISTICT aggregations in HAVING clause
• https://github.com/MTDATA/presto/commits/mt-0.60
Thursday, 17 April, 14
Reference
• http://prestodb.io/
• https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920
• http://www.slideshare.net/zhusx/presto-overview?from_search=1
• http://www.slideshare.net/frsyuki/hadoop-source-code-reading-15-in-japan-presto
Thursday, 17 April, 14
Thanks
Thursday, 17 April, 14