sparkspark in the big data dark by sergey levandovskiy

36
Spark in the BigData dark

Upload: lohikaodessatechtalks

Post on 12-Aug-2015

155 views

Category:

Engineering


1 download

TRANSCRIPT

Spark in the BigData dark

6+ years in HighLoad and BigData 3+ years as Team / Tech Lead Java, Scala, Javascript, PHP

About me

Hadoop components

Nano History

Apache Hadoop

Pros: • Batch operations • Scalability • User defined methods

Cons: • The problem must be resolved in context of a single

job • Filesystem based

Nano History

Tez, Pig, Hive, etc

Pros: • Batch operations • Over Hadoop • Faster then MapReduce • DAG

Cons: • Filesystem based

Nano History

What is in-memory?

• In-memory compute grid

• In-memory data grid

In-memory compute grid

In-memory data grid

HDD vs MEMORY?

• Memory speed is in nanoseconds• 10GbE Network speed is in microseconds (~50)• Flash speed is in microseconds (between 20-500+)• Disk speed is in milliseconds (between 4-7)

Spark in-memory model

Apache Spark

Pros: • In memory operations up to 100x times faster then

Hadoop MapReduce • On disc operations up to 10x times faster then Hadoop

MapReduce• In-memory• Batch operations & near real time • Interactive • Not bound to hadoop• Easy to start for developers

Really fast?

Is Spark popular?

HazelcastApache Spark Apache Hadoop

Is it popular?

The most active project

Who use Spark?

Languages

Libraries

RDD

• Resilient == fault-tolerant

• Distributed == compute in parallel

• Dataset == collection

How create RDD

• parallelize

• external dataset: filesystem, HDFS, HBase, etc

Lazy RDD

• map• filter• flatMap• mapPartitions• mapPartitionsWithIndex

• union• intersection• distinct• groupByKey• reduceByKey• join

• collect• count• first• take(n)• reduce• countByKey• foreach• takeOrdered• takeSample• saveAsTextFile• saveAsSequenceFile• saveAsObjectFile

Transformations Actions

Example

DataFrame

• Distributed collection of data organized into named columns

• SQL like syntax

• Catalyst Optimizer

• Catalyst Optimizer

DataFrame vs RDD

RDD

Cluster Overview

Cluster managers

• Standalone

• Apache Mesos

• Hadoop YARN

DEMO

• Standalone

• 1.5G dataset

• 2G RAM executor

DEMO

• Standalone

• 1.5G dataset

• 2G RAM executor

DEMO 2

https://goo.gl/xbnANN

[email protected]

Reference list

https://spark.apache.orghttps://databricks.com/bloghttp://hadoop.apache.org/docs/currenthttp://www.gridgain.comhttps://www.google.com/trendshttp://blog.revolutionanalytics.com/2013/12/apache-spark.htmlhttp://0xdata.com/blog/2014/09/Sparkling-Water/http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.htmlhttps://spark.apache.org/docs/1.3.1/job-scheduling.htmlhttps://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.htmlhttps://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.htmlhttp://aryannava.com/2014/02/19/apache-hadoop-ecosystem/http://www.gridgain.com/in-memory-compute-grid-explained/http://gridgain.blogspot.com/2012/11/gridgain-and-hadoop-differences-and.htmlhttp://blog.infinio.com/relative-speeds-from-ram-to-flash-to-disk

Thank You!

Sergey Levandovskiy
Add reference page