operational tips for deploying spark

21
Operational Tips for Deploying Spark Miklos Christine Solutions Engineer Databricks

Upload: databricks

Post on 13-Apr-2017

2.302 views

Category:

Engineering


9 download

TRANSCRIPT

Page 1: Operational Tips for Deploying Spark

Operational Tips for Deploying Spark

Miklos Christine Solutions EngineerDatabricks

Page 2: Operational Tips for Deploying Spark

$ whoami

• Previously @ Cloudera

• Deep Knowledge of Big Data Stack

• Apache Spark Expert

• Solutions Engineer @ Databricks!

Page 3: Operational Tips for Deploying Spark

Agenda• Quick Apache Spark Overview

• Configuration Systems

• Pipeline Design Best Practices

• Debugging Techniques

Page 4: Operational Tips for Deploying Spark

Apache Spark

Page 5: Operational Tips for Deploying Spark

Spark Configuration

Page 6: Operational Tips for Deploying Spark

• Command Line: spark-defaults.conf spark-env.sh

• Programmatically: SparkConf()

• Hadoop Configs:core-site.xmlhdfs-site.xml

Spark Core Configuration// Print SparkConfig

sc.getConf.toDebugString

// Print Hadoop Config

val hdConf = sc.hadoopConfiguration.iterator()

while (hdConf.hasNext){

println(hdConf.next().toString())

}

Page 7: Operational Tips for Deploying Spark

• Set SQL Configs Through SQL InterfaceSET key=value;sqlContext.sql(“SET spark.sql.shuffle.partitions=10;”)

• Tools to see current configurations// View SparkSQL Config Propertiesval sqlConf = sqlContext.getAllConfssqlConf.foreach(x => println(x._1 +" : " + x._2))

Spark SQL Configuration

Page 8: Operational Tips for Deploying Spark

• File Formats

• Compression Codecs

• Spark APIs

• Job Profiles

Spark Pipeline Design

Page 9: Operational Tips for Deploying Spark

File Formats• Text File Formats

– CSV– JSON

• Avro Row Format

• Parquet Columnar Format

Page 10: Operational Tips for Deploying Spark

Compression Codecs• Choose and Analyze Compression Codecs

– Snappy, Gzip, LZO

• Configuration Parameters– io.compression.codecs

– spark.sql.parquet.compression.codec

– spark.io.compression.codec

Page 11: Operational Tips for Deploying Spark

Small Files Problem • Small files problem still exists

• Metadata loading

• Use coalesce()

Ref:

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

Page 12: Operational Tips for Deploying Spark

• 2 Types of Partitioning– File level and Spark

# Get Number of Spark

df.rdd.getNumPartitions()

40

Partitioningdf.write.\

partitionBy(“colName”).\

saveAsTable(“tableName”)

Page 13: Operational Tips for Deploying Spark

• Leverage Spark UI– SQL – Streaming

Spark Job Profiles

Page 14: Operational Tips for Deploying Spark

Spark Job Profiles

Page 15: Operational Tips for Deploying Spark

Spark Job Profiles

Page 16: Operational Tips for Deploying Spark

• Monitoring & Metrics – Spark – Servers

● Toolset– Ganglia– Graphite

Job Profiles: Monitoring

Ref:http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/

Page 17: Operational Tips for Deploying Spark

● Analyze the Driver’s stacktrace.

● Analyze the executors stacktraces– Find the initial executor’s failure.

● Review metrics– Memory– Disk– Networking

Debugging Spark

Page 18: Operational Tips for Deploying Spark

● OutOfMemoryErrors– Driver– Executors

● Out of Disk Space Issues

● Long GC Pauses

● API Usage

Top Support Issues

Page 19: Operational Tips for Deploying Spark

● Use builtin functions instead of custom UDFs– import pyspark.sql.functions– import org.apache.spark.sql.functions

● Examples:

– to_date()

– get_json_object()

– regexp_extract()

Ref:http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

Top Support Issues

Page 20: Operational Tips for Deploying Spark

● SQL Joins– df_users.join(df_orders).explain()

– set spark.sql.autoBroadcastJoinThreshold

● Exported Parquet from External Systems– spark.sql.parquet.binaryAsString

● Tune number of Shuffle Partitions– spark.sql.shuffle.partitions

Top Support Issues

Page 21: Operational Tips for Deploying Spark

Thank [email protected]://www.linkedin.com/in/mrchristine