all about spark - meetupfiles.meetup.com/19103884/1_whysparkisimportant.pdf · spark background...

1st. Big Data Developers Lisbon Meetup

Luis Gregório, Analytics CTP || Willem Hendrix, Big Data AnalystLisboa, 16 de junho de 2016

All About Spark

Agenda

● Boas Vindas

● Qual a importância do SPARK

● Introdução técnica ao SPARK

● Machine Learning escalável com Apache SystemML

Boas VindasBig Data Developers Lisbon

Big Data Developers Lisbon

● Objectivo: – Discutir tecnologias e boas práticas das tecnologias de Big Data

● Patrocinado por:– IBM– … mas aberta à intervenção ativa de outras pessoas ou organizações

● Comunidade:– 291 membros registados– Queremos o vosso input para as próximas sessões.

40 1 2 3 4 5

Spark Expertise0-None, 5- Expert

Luis GregórioAnalytics CTPIBM Portugal

luis.gregorio@pt.ibm.comhttps://pt.linkedin.com/in/lgregorio/pt

Willem HendrixBig Data AnalystIBM Netherlands

Willem@nl.ibm.comhttps://nl.linkedin.com/in/willemhendriks/pt

Today Presenting...

Why Spark is Important

Spark Background

● Started as a research project in 2009, open source in 2010

– General purpose cluster computing system– Generalizes MapReduce– Batch oriented processing– Main concept: Resilient Distributed Datasets

(RDDs)

● Apache incubator project in June 2013– Apache top level project Feb 27, 2014

● Current version 1.6.0 (January 4, 2016)– Requires Scala 2.10.x, Maven– Languages supported: Java, Scala, Python, R

(Java 7+, Python 2.6+, R 3.1+)– May need additional libraries for Python

ex: numpy7

Key reasons for interest in Spark

Performant In-memory architecture greatly reduces disk I/O Anywhere from 20-100x faster for common tasks

Productive Concise and expressive syntax, especially compared to prior approaches

Single programming model across a range of use cases and steps in data lifecycle

Integrated with common programming languages – Java, Python, Scala

New tools continually reduce skill barrier for access (e.g. SQL for analysts)

Leverages existing investments

Works well within existing Hadoop ecosystem

Improves with age Large and growing community of contributors

continuously improve full analytics stack and extend capabilities

Beware of the hype!

Motivation for Apache Spark

● Traditional Approach: MapReduce jobs for complex jobs, interactive query, and online event-hub processing involves lots of (slow) disk I/O

HDFSRead

HDFSWrite

HDFSRead

HDFSWrite

Input ResultCPU

Iteration 1Iteration 1

Memory CPU

Memory

● Traditional Approach: MapReduce jobs for complex jobs, interactive query, and online event-hub processing involves lots of (slow) disk I/O

● Solution: Keep more data in-memory with a new distributed execution engine

HDFSRead

Input CPU

Memory CPU

Iteration 2Iteration 2Memory

faster than network & disk

HDFSRead

HDFSWrite

HDFSRead

HDFSWrite

Input ResultCPU

Memory CPU

Memory

Zero Read/Write Disk Bottleneck

Motivation for Apache Spark

Chain Job Output into New Job Input

Spark Common Use Cases

IBM is all-in on Spark

Launch Spark Technology Cluster (STC), 300 engineers

Open source SystemML

Partner with databricks

Contribute to the Core

Foster Community

Educate 1M+ data scientists and engineers via online

courses

Sponsor AMPLab, creators and evangelists of Spark

Infuse the Portfolio

Integrate Spark throughout portfolio

3,500 employees working on Spark-related topics

Spark however customers want it – standalone, platform or

products

"It's like Spark just got blessed by the enterprise

rabbi."

"It's like Spark just got blessed by the enterprise

rabbi."

Ben Horowitz,Andreessen Horowitz

Our goal is to be #1 Spark contributor and adopter

Focal point for IBM investment in Spark● Code contributions to Apache Spark project● Build industry solutions using Spark● Evangelize Spark technology inside/outside IBM

Agile engagement across IBM divisions● Systems: contribute enhancements to Spark core, and optimized infrastructure

(hardware/software) for Spark● Analytics: IBM Analytics software will exploit Spark processing● Research: build innovations above (solutions that use Spark), inside (improvements to Spark

core), and below (improve systems that execute Spark) the Spark stack

Our Use of Spark at IBM

IBM Cloudant on Bluemix

IBM Dataworks on Bluemix

IBM Commerce Dynamic Pricing

IBM Swift Object Storage

IBM Open Platform on Softlayer

IBM BigInsights on Bluemix

IBM Insights for Twitter Service

IBM Twitter CDE

IBM IoT on Bluemix

Journey Analytics

Mark Down Optimization

Nimbus ETL

Omni Channel Pricing

Apache Spark on Bluemix

IBM SPSS Analytics Server

IBM SPSS Modeler

Want to learn more ?

Free on-line courses at BigDataUniversity.com

ObrigadoPela atenção

all about spark - meetupfiles.meetup.com/19103884/1_whysparkisimportant.pdf · spark background...

Documents

spark cluster overview - prace

spark fast, interactive, language-integrated cluster...

creating a cluster on azure...spark 2 hdp 2.6 cluster type...

apache spark: hands-on session - ce.uniroma2.it · spark...

spark internals and architecture -...

spark - meetupfiles.meetup.com/3138542/spark in 2015 -...

talend spark meetupfiles.meetup.com/14077672/talend spark...

spark introduction rdd building and running spark...

apache spark lessons learned -...

intro to apache spark - databricks · pdf fileapache spark...

real-time analytics with spark -...

learning spark ch07 - running on a cluster

apache spark - meetupfiles.meetup.com/14077672/staxing...

spark-on-yarn: empower spark applications on hadoop cluster

apache spark - lightning fast cluster computing - hyderabad...

spark internals and architecture -...

rapid cluster computing with apache spark 2016

introduction to apache spark -...

an introduction to apache spark -...

cluster - spark