strata nyc 2015 - what's coming for the spark community
Post on 14-Jan-2017
934 Views
Preview:
TRANSCRIPT
What’s New in the Spark Community
Patrick Wendell | @pwendell
About Me
Co-Founder of Databricks Founding committer of Apache Spark at U.C. Berkeley Today, manage Spark effort @ Databricks
About Databricks
Team donated Spark to ASF in 2013; primary maintainers of Spark today Hosted analytics stack based on Apache Spark Managed clusters, notebooks, collaboration, and third party apps:
Today’s Talk
Quick overview of Apache Spark Technical roadmap directions Community and ecosystem trends
What is your familiarity with Spark?
1. Not very familiar with Spark – only very high level. 2. Understand the components/uses well, but I’ve never written code. 3. I’ve written Spark code on POC or production use case of Spark.
“Spark is the Taylor Swift of big data software.” - Derrick Harris, Fortune
…
Apache Spark Engine
Spark Core
Streaming SQL and
Dataframe MLlib GraphX
Unified engine across diverse workloads & environments
Scale out, fault tolerant Python, Java, Scala, and R APIs Standard libraries
This Talk
“What’s new” in Spark? And what’s coming? Two parts: Technical roadmap and community developments
“The future is already here — it's just not very evenly distributed.” - William Gibson
Technical Directions
Spark Technical Directions
Higher level API’s Make developers more productive
Performance of key execution primitives
Shuffle, sorting, hashing, and state management Pluggability and extensibility
Make it easy for other projects to integrate with Spark
Spark Technical Directions
Higher level API’s Make developers more productive
Performance of key execution primitives
Shuffle, sorting, hashing, and state management Pluggability and extensibility
Make it easy for other projects to integrate with Spark
Higher Level API’s
Making Spark accessible to data scientists, engineers, statisticians…
Computing an Average: MapReduce vs Spark
private IntWritable one = new IntWritable(1) private IntWritable output = new IntWritable() proctected void map( LongWritable key, Text value, Context context) { String[] fields = value.split("\t") output.set(Integer.parseInt(fields[1])) context.write(one, output) } IntWritable one = new IntWritable(1) DoubleWritable average = new DoubleWritable() protected void reduce( IntWritable key, Iterable<IntWritable> values, Context context) { int sum = 0 int count = 0 for(IntWritable value : values) { sum += value.get() count++ } average.set(sum / (double) count) context.Write(key, average) }
data = sc.textFile(...).split("\t") data.map(lambda x: (x[0], [x.[1], 1])) \ .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \ .map(lambda x: [x[0], x[1][0] / x[1][1]]) \ .collect()
13
Computing an Average with Spark
data = sc.textFile(...).split("\t") data.map(lambda x: (x[0], [x.[1], 1])) \ .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \ .map(lambda x: [x[0], x[1][0] / x[1][1]]) \ .collect()
14
Computing an Average with DataFrames
sqlCtx.table("people") \ .groupBy("name") \ .agg("name", avg("age")) \ .collect()
15
Spark DataFrame API
Explicit data model and schema Selecting columns and filtering Aggregation (count, sum, average, etc)
User defined functions Joining different data sources Statistical functions and easy plotting Python, Scala, Java, and R
16
sqlCtx.table("people") \ .groupBy("name") \ .agg("name", avg("age")) \ .collect()
Ask more of your framework! MapReduce Spark Spark + DataFrames Fault tolerance Fault tolerance Fault tolerance
Data distribution Data distribution Data distribution
Set operators Set operators
Operator DAG Operator DAG
Caching Caching
Schema management
Relational semantics
Logical plan optimization
Storage push down and opt.
Analytic operations
…
Other high level API’s
ML Pipelines SparkR
ds0 ds1 ds2 ds3 tokenizer hashingTF lr.model
lr
> faithful <-‐ read.df("faithful.json", "json”) > head(filter(faithful, faithful $waiting < 50)) ## eruptions waiting ##1 1.750 47 ##2 1.750 47 ##3 1.867 48
Spark Technical Directions
Higher level API’s Make developers more productive
Performance of key execution primitives
Shuffle, sorting, hashing, and state management Pluggability and extensibility
Make it easy for other projects to integrate with Spark
Performance Initiatives
Project Tungsten – improving runtime efficiency of key internals Everything else – IO optimizations, dynamic plan re-writing
Project Tungsten: The CPU Squeeze
2010 2015
Storage 50+MB/s (HDD)
500+MB/s (SSD) 10X
Network 1Gbps 10Gbps 10X
CPU ~3GHz ~3GHz L
Project Tungsten Code generation for CPU efficiency
Code generation on by default and using Janino [SPARK-7956] Beef up built-in UDF library (added ~100 UDF’s with code gen)
AddMonths ArrayContains Ascii Base64 Bin BinaryMathExpression CheckOverflow CombineSets Contains CountSet Crc32 DateAdd
DateDiff DateFormatClass DateSub DayOfMonth DayOfYear Decode Encode EndsWith Explode Factorial FindInSet FormatNumber FromUTCTimestamp
FromUnixTime GetArrayItem GetJsonObject GetMapValue Hex InSet InitCap IsNaN IsNotNull IsNull LastDay Length Levenshtein
Like Lower MakeDecimal Md5 Month MonthsBetween NaNvl NextDay Not PromotePrecision Quarter RLike Round
Second Sha1 Sha2 ShiYLeY ShiYRight ShiYRightUnsigned SortArray SoundEx StartsWith StringInstr StringRepeat StringReverse StringSpace
StringSplit StringTrim StringTrimLeY StringTrimRight TimeAdd TimeSub ToDate ToUTCTimestamp TruncDate UnBase64 UnaryMathExpression Unhex UnixTimestamp
Project Tungsten
Binary processing for memory management (all data types): External sorting with managed memory External hashing with managed memory
Memory page
hc ptr
…
key value key value key value key value
key value key value
Managed Memory HashMap in Tungsten
Python Java/Scala R SQL …
DataFrame Logical Plan
LLVM JVM GPU NVRAM
Where are we going?
Tungsten backend
language frontend
…
Tungsten Execution
Python SQL R Streaming
DataFrame
Advanced Analytics
Spark Technical Directions
Higher level API’s Make developers more productive
Performance of key execution primitives
Shuffle, sorting, hashing, and state management Pluggability and extensibility
Make it easy for other projects to integrate with Spark
Pluggability: Rich IO Support
df = sqlContext.read \ .format("json") \ .option("samplingRatio", "0.1") \ .load("/home/michael/data.json”) df.write \ .format("parquet") \ .mode("append") \ .partitionBy("year") \ .saveAsTable("fasterData")
Unified interface to reading/writing data in a variety of formats
Large Number of IO Integration
Spark SQL’s Data Source API can read and write DataFrames using a variety of formats.
28
{ JSON }
Built-In External
JDBC
and more…
Find more sources at http://spark-packages.org/
Deployment Integrations
Technical Directions
Early on, the focus was: Can Spark be an engine that is faster and easier to use than Hadoop MapReduce?
Today the question is:
Can Spark & its ecosystem make big data as easy as little data?
Community/User Growth
Who is the “Spark Community”?
thousands of users
… hundreds of developers
… dozens of distributors
Getting a better vantage point
Databricks survey - feedback from more than 1,400 users
Community trends: Library & package ecosystem
Strata NY 2014: Widespread use of core RDD API Today: Most use built-in and community libraries
51% of users use 3 or more libraries
Spark Packages
Strata NY 2014: Didn’t exist Today: > 100 community packages
> ./bin/spark-shell --packages databricks/spark-avro:0.2
Spark Packages
API Extensions Clojure API
Spark Kernel
Zepplin Notebook
Indexed RDD
Deployment Utilities
Google Compute
Microsoft Azure
Spark Jobserver
Data Sources Redshift
Avro CSV
Elastic Search MongoDB
Increasing storage options
Strata NY 2014: IO primarily through Hadoop InputFormat API January 2015: Spark adds native storage API Today: Well over 20 natively integrated storage bindings
Cassandra, ElasticSearch, MongoDB, Avro, Parquet, ORC, HBase,
Redshift, SAP, CSV, Cloudant, Oracle, JDBC, SequoiaDB, Couchbase…
Deployment environments
Strata NY 2014: Traction in the Hadoop community
Today: Growth beyond Hadoop… increasingly public cloud
51% of respondents run Spark in public cloud
Wrapping it up
Spark has grown and developed quickly in the last year! Looking forward expect: - Engineering effort on higher level API’s and performance - A broader surrounding ecosystem - The unexpected
Where to learn more about Spark?
SparkHub community portal Spark Summit conference - https://spark-summit.org/ Massive online course (edX): Databricks Spark training Books:
Questions?
top related