presto

[email protected]

Thursday, 17 April, 14

mailto:[email protected]

mailto:[email protected]

Content

• Background

• Architecture

• Key points for low query latency

• What we do

• Reference


Background

• 300+PB data stored in Hadoop/HDFS-based clusters

• More queries and get results faster improves analysts, data scientists, and engineers productivity

• MapReduce and Hive are designed for large-scale, reliable computation

• External projects too nascent or did not meet our requirements for flexibility and scale


Architecture


Key points for low latency

• In memory parallel computing

• Pipeline

• Data local computation

• Data cache

• Dynamic compile part of the plan to byte code

• Careful use of memory and data structure

• BlinkDB liked approximate queries

• Traditional SQL optimize

• GC controlThursday, 17 April, 14

Compile flow


In memory parallel computingselect c1.rank, count(*) from dim.city c1 join dim.city c2 on c1.id = c2.id

where c1.id > 10 group by c1.rank limit 10;


In memory parallel computing


In memory parallel computing

• PlanDistribution=Source– InputSplit[] splits =

inputFormat.getSplits(jobConf, 0);

• PlanDistribution=Hash– Hash Shuffle– Fixed Workers– query.initial-hash-partitions


SplitRunner thread number task.shard.max-threads=availableProcessors() * 4

Pipeline - TaskExecutor


Pipeline - Operator process flow

Page(max page size: 1MB, max rows: 16 * 1024 )


Pipeline - ExchangeOperator


Data local computation

• Select acceptable nodes (as least 10 nodes by default)– Nodes has the same address– If not enough, add nodes in the same rack– If not enough, randomly select nodes in other racks

• Select the node with the smallest number of assignments (pending tasks)


Data cache

• Google Guava LoadingCache• Cached Objects– HiveMeta database table partition– Byte Code Class

FilterAndProjectOperatorFactoryFactory, ScanFilterAndProjectOperatorFactoryFactory

– functions


Dynamic compile plan to byte code

• Presto dynamic compile FilterAndProjectOperator and ScanFilterAndProjectOperator to byte code which lets the JIT optimize and generate native machine code

• How much does it speed up ?• ScanFilterAndProjectOperator


Careful use mem & data structure

• Slice– Unsafe#copyMemory– 20% ~ 30% speed up for ORCFile write performance

• ThreadLocalRandom– ThreadLocal seed instead of AtomicLong– 100% speed up

• ListenableFuture– Async Callback


Approximate queries

• approx_avg, approx_distinct, approx_percentile• +50% speed up


Traditional SQL optimize

• ImplementSampleAsFilter• LimitPushDown• MaterializeSamplePullUp• MergeProjections• PredicatePushDown• PruneRedundantProjections• PruneUnreferencedOutputs• SetFlatteningOptimizer• SimplifyExpressions• UnaliasSymbolReferences


GC control

• A JDK 1.7 BUG • When code cache fills up, there is a chance that JIT

might stop compile byte code to native code.• By forcing classes to unload from the perm gen,

we let the code cache evictor make room before the cache fills up.

• System.gc()


What we do

• Support kerberos authentication

• Implicit type coercion

• Support reading lzo compressed tables

• Implement useful functions

• Fix planning issue when using DISTICT aggregations in HAVING clause

• https://github.com/MTDATA/presto/commits/mt-0.60


https://github.com/MTDATA/presto/commits/mt-0.60




Reference

• http://prestodb.io/

• https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920

• http://www.slideshare.net/zhusx/presto-overview?from_search=1

• http://www.slideshare.net/frsyuki/hadoop-source-code-reading-15-in-japan-presto


http://prestodb.io/

http://prestodb.io/

https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920






http://www.slideshare.net/zhusx/presto-overview?from_search=1




http://www.slideshare.net/frsyuki/hadoop-source-code-reading-15-in-japan-presto




Thanks


presto

Software

compile byte code

native code

code cache lls

code cache evictor

data scientists

pb data

native machine code

data structure blinkdb