all about spark - meetupfiles.meetup.com/19103884/1_whysparkisimportant.pdf · spark background...
TRANSCRIPT
1st. Big Data Developers Lisbon Meetup
Luis Gregório, Analytics CTP || Willem Hendrix, Big Data AnalystLisboa, 16 de junho de 2016
All About Spark
Agenda
● Boas Vindas
● Qual a importância do SPARK
● Introdução técnica ao SPARK
● Machine Learning escalável com Apache SystemML
2
Boas VindasBig Data Developers Lisbon
Big Data Developers Lisbon
● Objectivo: – Discutir tecnologias e boas práticas das tecnologias de Big Data
● Patrocinado por:– IBM– … mas aberta à intervenção ativa de outras pessoas ou organizações
● Comunidade:– 291 membros registados– Queremos o vosso input para as próximas sessões.
40 1 2 3 4 5
0
10
20
30
40
50
Spark Expertise0-None, 5- Expert
Luis GregórioAnalytics CTPIBM Portugal
[email protected]://pt.linkedin.com/in/lgregorio/pt
Willem HendrixBig Data AnalystIBM Netherlands
[email protected]://nl.linkedin.com/in/willemhendriks/pt
Today Presenting...
5
Why Spark is Important
Spark Background
● Started as a research project in 2009, open source in 2010
– General purpose cluster computing system– Generalizes MapReduce– Batch oriented processing– Main concept: Resilient Distributed Datasets
(RDDs)
● Apache incubator project in June 2013– Apache top level project Feb 27, 2014
● Current version 1.6.0 (January 4, 2016)– Requires Scala 2.10.x, Maven– Languages supported: Java, Scala, Python, R
(Java 7+, Python 2.6+, R 3.1+)– May need additional libraries for Python
ex: numpy7
Key reasons for interest in Spark
Performant In-memory architecture greatly reduces disk I/O Anywhere from 20-100x faster for common tasks
Productive Concise and expressive syntax, especially compared to prior approaches
Single programming model across a range of use cases and steps in data lifecycle
Integrated with common programming languages – Java, Python, Scala
New tools continually reduce skill barrier for access (e.g. SQL for analysts)
Leverages existing investments
Works well within existing Hadoop ecosystem
Improves with age Large and growing community of contributors
continuously improve full analytics stack and extend capabilities
Beware of the hype!
8
Motivation for Apache Spark
● Traditional Approach: MapReduce jobs for complex jobs, interactive query, and online event-hub processing involves lots of (slow) disk I/O
HDFSRead
HDFSWrite
HDFSRead
HDFSWrite
Input ResultCPU
Iteration 1Iteration 1
Memory CPU
Iteration 2Iteration 2
Memory
9
● Traditional Approach: MapReduce jobs for complex jobs, interactive query, and online event-hub processing involves lots of (slow) disk I/O
● Solution: Keep more data in-memory with a new distributed execution engine
HDFSRead
Input CPU
Iteration 1Iteration 1
Memory CPU
Iteration 2Iteration 2Memory
faster than network & disk
HDFSRead
HDFSWrite
HDFSRead
HDFSWrite
Input ResultCPU
Iteration 1Iteration 1
Memory CPU
Iteration 2Iteration 2
Memory
Zero Read/Write Disk Bottleneck
Motivation for Apache Spark
Chain Job Output into New Job Input
10
Spark Common Use Cases
11
IBM is all-in on Spark
Launch Spark Technology Cluster (STC), 300 engineers
Open source SystemML
Partner with databricks
Contribute to the Core
Foster Community
Educate 1M+ data scientists and engineers via online
courses
Sponsor AMPLab, creators and evangelists of Spark
Infuse the Portfolio
Integrate Spark throughout portfolio
3,500 employees working on Spark-related topics
Spark however customers want it – standalone, platform or
products
"It's like Spark just got blessed by the enterprise
rabbi."
"It's like Spark just got blessed by the enterprise
rabbi."
Ben Horowitz,Andreessen Horowitz
12
Our goal is to be #1 Spark contributor and adopter
Focal point for IBM investment in Spark● Code contributions to Apache Spark project● Build industry solutions using Spark● Evangelize Spark technology inside/outside IBM
Agile engagement across IBM divisions● Systems: contribute enhancements to Spark core, and optimized infrastructure
(hardware/software) for Spark● Analytics: IBM Analytics software will exploit Spark processing● Research: build innovations above (solutions that use Spark), inside (improvements to Spark
core), and below (improve systems that execute Spark) the Spark stack
13
Our Use of Spark at IBM
IBM Cloudant on Bluemix
IBM Dataworks on Bluemix
IBM Commerce Dynamic Pricing
IBM Swift Object Storage
IBM Open Platform on Softlayer
IBM BigInsights on Bluemix
IBM Insights for Twitter Service
IBM Twitter CDE
IBM IoT on Bluemix
Journey Analytics
Mark Down Optimization
Nimbus ETL
Omni Channel Pricing
Apache Spark on Bluemix
IBM SPSS Analytics Server
IBM SPSS Modeler
14
Want to learn more ?
15
Free on-line courses at BigDataUniversity.com
ObrigadoPela atenção