from java stream to java dataframe

of 44/44
2016.08 From Java Stream to Java DataFrame Popcorny (陸陸陸)

Post on 09-Jan-2017

1.554 views

Category:

Data & Analytics

1 download

Embed Size (px)

TRANSCRIPT

PowerPoint

2016.08

From Java Stream to Java DataFrame

Popcorny ()

OutlineJava StreamDataFramePoppy Demo

TenMaxeventrawlograwlogAggregated Data(dimensions)(metrics)OLAP

Raw Log

Aggregated Data(Cube)

BatchaggregateIngestInteractiveQuery

RDBMS

RDBMS(RawLog)

RDBMS(Cube)

BatchaggregateIngestInteractiveQuery

RDBMSRDBMSLog IngestionDFS: Append-OnlyCassandra/Hbase: Insert, Row-basedupdate, delete, partition scan

DFS or Cassandra

RDBMS

BatchaggregateIngestInteractiveQuery

Aggregation

AggregationSolutionComputation EngineHadoop MapReduceHiveSpark SQLImpalaClusterHeavy weightDependency (driver)HDFS-Compatibledata sourceJobUDF / UDAF ..

Solution?

1G ~ 1T uncompressed datarecord = 1K, 1T = 10CPU 1 core134.56Scale up16I/ONetwork throughputsolutionoverheadsolutiondebug

Aggregation

Java8 LambdaStreamOptionalCompletableFuture

Java StreamFunctional Reactive Programming (FRP)Pipeline StyleInputtransformationOutputStreamingMemory Footprint

forEach()map()filter()flapMap()peek()

Aggregation?

SQL

From RawLog

Where (DayRange)

GroupBy

sum(),sum(),sum()hour=?,dim1=?,dim2=?val1, val2, val3sum(),sum(),sum()hour=?,dim1=?,dim2=?val1, val2, val3sum(),sum(),sum()hour=?,dim1=?,dim2=?val1, val2, val3sum(),sum(),sum()hour=?,dim1=?,dim2=?val1, val2, val3sum(),sum(),sum()hour=?,dim1=?,dim2=?val1, val2, val3sum(),sum(),sum()hour=?,dim1=?,dim2=?val1, val2, val3

Java Stream Aggregation

From

Where

GroupBy

Aggregationcount(), sum()

Mapper

Reducer

Java Streamjava.util.stream.CollectormetricsaggregationColumn BasedType

Poppyhttp://tenmax.github.io/poppy/

Introduction to PoppyPoppyJavaDataFrame LibraryData Frame?Column based (Schema)RDBMS select, from, where, group by, aggregation, order byPoppyStream based ()partitionUser Defined Function, User Defined Aggregation FunctionLightweightSchemaJava Stream

http://tenmax.github.io/poppy/

Poppy

from

where

group by

aggregation

Thats All!!

PoppyPipelineInputOperationsOutputhttp://tenmax.github.io/poppy/

OutputOperationOperationOperationOperation

Input

InputBy IterableDataFrame.from(Class clazz, java.util.Iterable... iterables)By DataSourceDataFrame.from(io.tenmax.DataSource dataSource)DataSourcehttp://tenmax.github.io/poppy/

Outputiterator(), forEach()toList(),toMap(), print()DataFrame.to(DataSink dataSink)DataSinkhttp://tenmax.github.io/poppy/

Operationsproject()filter()Aggregation()groupby()Sort()distinct()peek()cache()http://tenmax.github.io/poppy/

Projection (Select)http://tenmax.github.io/poppy/

Filter (Where, Having)http://tenmax.github.io/poppy/

Aggregation (Count, Sum, Avg, )http://tenmax.github.io/poppy/

Sort (Order by)http://tenmax.github.io/poppy/

Distincthttp://tenmax.github.io/poppy/

Demohttp://tenmax.github.io/poppy/

User-Defined Functionhttp://tenmax.github.io/poppy/ java.util.function,Function

User-Defined Aggregation Functionhttp://tenmax.github.io/poppy/ java.util.stream,Collector

PartitionDataSourcePartitiondataFrame.parallel(n)threadhttp://tenmax.github.io/poppy/

Execution ContextExecution Contextthread pool n threads m partitionsm >= nthreadpartitionpartitionhttp://tenmax.github.io/poppy/

Execution Contextaggregation, sort, distinctexecution contexthttp://tenmax.github.io/poppy/

Demohttp://tenmax.github.io/poppy/

ConclusionJava StreamColumn-basedDataFrame Library PoppyColumn-basedlightweighthttp://tenmax.github.io/poppy/

ConclusionJava StreamColumn-basedDataFrame Library PoppyColumn-basedlightweighthttp://tenmax.github.io/poppy/

ReferenceProject Site - http://tenmax.github.io/poppy/Poppy User Manual - http://tenmax.github.io/poppy/Poppy Javadoc - http://tenmax.github.io/poppy/docs/javadoc/index.htmlJava - https://www.gitbook.com/book/popcornylu/java_multithread/detailspq - https://github.com/tenmax/pqhttp://tenmax.github.io/poppy/

http://tenmax.github.io/poppy/

Thank you! Question?http://tenmax.github.io/poppy/