all about spark - meetupfiles.meetup.com/19103884/1_whysparkisimportant.pdf · spark background...

16
1 st . Big Data Developers Lisbon Meetup Luis Gregório, Analytics CTP || Willem Hendrix, Big Data Analyst Lisboa, 16 de junho de 2016 All About Spark

Upload: others

Post on 03-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: All About Spark - Meetupfiles.meetup.com/19103884/1_WhySparkIsImportant.pdf · Spark Background Started as a research project in 2009, open source in 2010 – General purpose cluster

1st. Big Data Developers Lisbon Meetup

Luis Gregório, Analytics CTP || Willem Hendrix, Big Data AnalystLisboa, 16 de junho de 2016

All About Spark

Page 2: All About Spark - Meetupfiles.meetup.com/19103884/1_WhySparkIsImportant.pdf · Spark Background Started as a research project in 2009, open source in 2010 – General purpose cluster

Agenda

● Boas Vindas

● Qual a importância do SPARK

● Introdução técnica ao SPARK

● Machine Learning escalável com Apache SystemML

2

Page 3: All About Spark - Meetupfiles.meetup.com/19103884/1_WhySparkIsImportant.pdf · Spark Background Started as a research project in 2009, open source in 2010 – General purpose cluster

Boas VindasBig Data Developers Lisbon

Page 4: All About Spark - Meetupfiles.meetup.com/19103884/1_WhySparkIsImportant.pdf · Spark Background Started as a research project in 2009, open source in 2010 – General purpose cluster

Big Data Developers Lisbon

● Objectivo: – Discutir tecnologias e boas práticas das tecnologias de Big Data

● Patrocinado por:– IBM– … mas aberta à intervenção ativa de outras pessoas ou organizações

● Comunidade:– 291 membros registados– Queremos o vosso input para as próximas sessões.

40 1 2 3 4 5

0

10

20

30

40

50

Spark Expertise0-None, 5- Expert

Page 5: All About Spark - Meetupfiles.meetup.com/19103884/1_WhySparkIsImportant.pdf · Spark Background Started as a research project in 2009, open source in 2010 – General purpose cluster

Luis GregórioAnalytics CTPIBM Portugal

[email protected]://pt.linkedin.com/in/lgregorio/pt

Willem HendrixBig Data AnalystIBM Netherlands

[email protected]://nl.linkedin.com/in/willemhendriks/pt

Today Presenting...

5

Page 6: All About Spark - Meetupfiles.meetup.com/19103884/1_WhySparkIsImportant.pdf · Spark Background Started as a research project in 2009, open source in 2010 – General purpose cluster

Why Spark is Important

Page 7: All About Spark - Meetupfiles.meetup.com/19103884/1_WhySparkIsImportant.pdf · Spark Background Started as a research project in 2009, open source in 2010 – General purpose cluster

Spark Background

● Started as a research project in 2009, open source in 2010

– General purpose cluster computing system– Generalizes MapReduce– Batch oriented processing– Main concept: Resilient Distributed Datasets

(RDDs)

● Apache incubator project in June 2013– Apache top level project Feb 27, 2014

● Current version 1.6.0 (January 4, 2016)– Requires Scala 2.10.x, Maven– Languages supported: Java, Scala, Python, R

(Java 7+, Python 2.6+, R 3.1+)– May need additional libraries for Python

ex: numpy7

Page 8: All About Spark - Meetupfiles.meetup.com/19103884/1_WhySparkIsImportant.pdf · Spark Background Started as a research project in 2009, open source in 2010 – General purpose cluster

Key reasons for interest in Spark

Performant In-memory architecture greatly reduces disk I/O Anywhere from 20-100x faster for common tasks

Productive Concise and expressive syntax, especially compared to prior approaches

Single programming model across a range of use cases and steps in data lifecycle

Integrated with common programming languages – Java, Python, Scala

New tools continually reduce skill barrier for access (e.g. SQL for analysts)

Leverages existing investments

Works well within existing Hadoop ecosystem

Improves with age Large and growing community of contributors

continuously improve full analytics stack and extend capabilities

Beware of the hype!

8

Page 9: All About Spark - Meetupfiles.meetup.com/19103884/1_WhySparkIsImportant.pdf · Spark Background Started as a research project in 2009, open source in 2010 – General purpose cluster

Motivation for Apache Spark

● Traditional Approach: MapReduce jobs for complex jobs, interactive query, and online event-hub processing involves lots of (slow) disk I/O

HDFSRead

HDFSWrite

HDFSRead

HDFSWrite

Input ResultCPU

Iteration 1Iteration 1

Memory CPU

Iteration 2Iteration 2

Memory

9

Page 10: All About Spark - Meetupfiles.meetup.com/19103884/1_WhySparkIsImportant.pdf · Spark Background Started as a research project in 2009, open source in 2010 – General purpose cluster

● Traditional Approach: MapReduce jobs for complex jobs, interactive query, and online event-hub processing involves lots of (slow) disk I/O

● Solution: Keep more data in-memory with a new distributed execution engine

HDFSRead

Input CPU

Iteration 1Iteration 1

Memory CPU

Iteration 2Iteration 2Memory

faster than network & disk

HDFSRead

HDFSWrite

HDFSRead

HDFSWrite

Input ResultCPU

Iteration 1Iteration 1

Memory CPU

Iteration 2Iteration 2

Memory

Zero Read/Write Disk Bottleneck

Motivation for Apache Spark

Chain Job Output into New Job Input

10

Page 11: All About Spark - Meetupfiles.meetup.com/19103884/1_WhySparkIsImportant.pdf · Spark Background Started as a research project in 2009, open source in 2010 – General purpose cluster

Spark Common Use Cases

11

Page 12: All About Spark - Meetupfiles.meetup.com/19103884/1_WhySparkIsImportant.pdf · Spark Background Started as a research project in 2009, open source in 2010 – General purpose cluster

IBM is all-in on Spark

Launch Spark Technology Cluster (STC), 300 engineers

Open source SystemML

Partner with databricks

Contribute to the Core

Foster Community

Educate 1M+ data scientists and engineers via online

courses

Sponsor AMPLab, creators and evangelists of Spark

Infuse the Portfolio

Integrate Spark throughout portfolio

3,500 employees working on Spark-related topics

Spark however customers want it – standalone, platform or

products

"It's like Spark just got blessed by the enterprise

rabbi."

"It's like Spark just got blessed by the enterprise

rabbi."

Ben Horowitz,Andreessen Horowitz

12

Page 13: All About Spark - Meetupfiles.meetup.com/19103884/1_WhySparkIsImportant.pdf · Spark Background Started as a research project in 2009, open source in 2010 – General purpose cluster

Our goal is to be #1 Spark contributor and adopter

Focal point for IBM investment in Spark● Code contributions to Apache Spark project● Build industry solutions using Spark● Evangelize Spark technology inside/outside IBM

Agile engagement across IBM divisions● Systems: contribute enhancements to Spark core, and optimized infrastructure

(hardware/software) for Spark● Analytics: IBM Analytics software will exploit Spark processing● Research: build innovations above (solutions that use Spark), inside (improvements to Spark

core), and below (improve systems that execute Spark) the Spark stack

13

Page 14: All About Spark - Meetupfiles.meetup.com/19103884/1_WhySparkIsImportant.pdf · Spark Background Started as a research project in 2009, open source in 2010 – General purpose cluster

Our Use of Spark at IBM

IBM Cloudant on Bluemix

IBM Dataworks on Bluemix

IBM Commerce Dynamic Pricing

IBM Swift Object Storage

IBM Open Platform on Softlayer

IBM BigInsights on Bluemix

IBM Insights for Twitter Service

IBM Twitter CDE

IBM IoT on Bluemix

Journey Analytics

Mark Down Optimization

Nimbus ETL

Omni Channel Pricing

Apache Spark on Bluemix

IBM SPSS Analytics Server

IBM SPSS Modeler

14

Page 15: All About Spark - Meetupfiles.meetup.com/19103884/1_WhySparkIsImportant.pdf · Spark Background Started as a research project in 2009, open source in 2010 – General purpose cluster

Want to learn more ?

15

Free on-line courses at BigDataUniversity.com

Page 16: All About Spark - Meetupfiles.meetup.com/19103884/1_WhySparkIsImportant.pdf · Spark Background Started as a research project in 2009, open source in 2010 – General purpose cluster

ObrigadoPela atenção