couchbase server and spark machine learning meetup

Download Couchbase Server and Spark Machine Learning Meetup

If you can't read please download the document

Post on 15-Apr-2017

230 views

Category:

Software

3 download

Embed Size (px)

TRANSCRIPT

PowerPoint Presentation

Machine Learning DemoSpark & CouchbaseWill Gardella, Product Manager@WillGardellawill.gardella@couchbase.com

1

AgendaTechnologies Spark & CouchbaseML Use CasesDemoGet Couchbase 4.5 Enterprise Edition http://www.couchbase.com/nosql-databases/downloads Sample Code on Github Word2VecExample.scalahttps://github.com/couchbaselabs/couchbase-spark-samples

2016 Couchbase Inc.#

2

Spark

Spark not slowing yetSource: http://stackoverflow.com/research/developer-survey-2016

2016 Couchbase Inc.#

4

Yep, Spark is popular

2016 Couchbase Inc.#

Hello, Spark!Fast, general engine for big data processing with libraries for advanced analytics

2016 Couchbase Inc.#

Spark Core:task scheduling memory managementfault recoveryinteracting with storage systems

Lots of code reuse compared to Hadoop w/ MR, projects are essentially stovepipedDevelop on a few nodes

Separates operations concerns from development concerns

Spark SQLEasy JSON handling and queryingTight N1QL Integration (planned before GA)

Spark StreamingPersisting DStreamsDCP source (planned before GA)

6

SparkFast100x better than MR when in-memory, 10x on diskSophisticated Powerful primitives not just MRAdvanced algorithms, graph, machine learningDeveloper ConvenienceWell designed APIs in Java, Scala, Python, RSupports SQL, DataFrames, Datasets and many other formatsInteractive shell (REPL), standalone mode, Batch & Streaming

2016 Couchbase Inc.#

Spark is much better for highly iterative algorithms and interactive queries which is important, given that the majority of jobs work on 100GB or less of data (small big data)

Theres not much that Hadoop MR is better at performance-wise, possibly if data set is orders of magnitude larger than cluster RAM, but thats speculation7

Couchbase

Couchbase addresses the needs of Digital Economy businesses

2016 Couchbase Inc.#

Combines the flexibility of JSON, power of SQL, and scalability of NoSQLDevelop with AgilityOperate at Any ScaleFlexible JSON data modelDynamic schema supportPowerful query language (N1QL) extends SQL to JSONSub-millisecond latency at scaleElastic scaling on commodity serversHigh availability

Couchbase Server the operational DBMS for web, mobile & IoT

2016 Couchbase Inc.#

Achieving scale & availability with Couchbase2014 Couchbase, Inc.11Scale cluster online with growing application needs, on demandBuild always available apps with replication & failoverRemove programming complexity by pushing sharding to the database

2016 Couchbase Inc.#

KEY POINT: COUCHBASE PROVIDES PROVIDES CORE FEATURES INCLUDING BUILT-IN REPLICATION AND DISTRIBUTED DATA MANAGEMENT THAT ALLOW YOU TO BUILD HIGHLY SCALABLE AND AVAILABLE APPLICATIONS.

When it comes to delivering scalability and availability with Couchbase, a number of architectural features come into play.

Built-in replication: Every server takes care of some active data and some replica data.

Cluster Map: The app server tier at the top includes the Couchbase Client Library -- which is similar to an SDK or JDBC driver in the relational world and includes the Cluster Map.

The Cluster Map is important, because thats what makes it transparent to the application about where the data lives. The database takes care of how data gets distributed and where to access any specific piece of data. So all that complexity is removed from the app programming and pushed to the database.

So here for example, a data request from App Server 2 comes in, and the Cluster Map knows that data lives on Couchbase node 1 (and by the way is replicated to node N), so it retrieves the data on server 1 and returns it to the app -- completely transparent to the application and the developer. Couchbase takes care of all that magic, as well as the availability which is the replication aspect.

So these capabilities allow you to build highly scalable and available applications and remove a lot of the complexity that you face with relational databases.

11

Achieve Global Data Distribution and HA/DR12Built-in Cross Data Center Replication (XDCR)

2016 Couchbase Inc.#

KEY POINT: COUCHBASE PROVIDES BUILT-IN REPLICATION ACROSS CLUSTERS AND DATA CENTERS, TO MEET YOUR REQUIREMENTS FOR DISASTER RECOVERY AND DATA LOCALITY.

One of the key strengths of Couchbase is its ability to replicate data not just within the cluster but across clusters.

Many enterprises are looking to build applications with built-in disaster recovery and data locality, so they can place data closer to their customers to improve the customer experience.

With a couple of clicks, you can set up your Couchbase database to replicate data seamlessly across geographically distributed clusters and data centers, giving you both Disaster Recovery and data locality capabilities.

This diagram shows a topology with bi-directional replication between 3 data centers, but with Couchbase you can easily configure your data replication for virtually any topology and you can change it on the fly as your requirements change.

12

N1QL access to JSON

N1QL - Next generation, NoSQL query language

SELECT FROM JOIN WHERE LIKE GROUP etc.,

Powerful Extensions for JSON(Un)Folding of nested Structures with NEST, UNNEST, Array Handling EVERY/ANY IN array SATISFIES

2016 Couchbase Inc.#

KEY POINT: N1QL is a marriage of the strengths of the JSON flexible schema with the power and familiarity of SQL.13

Couchbase & Spark

Damn it Jim, Im a big data processing engine, not a database!

2016 Couchbase Inc.#

15

NoSQL + Spark use cases

OperationsAnalysisRecommendationsNext gen data warehousingPredictive analyticsFraud detectionCatalog Customer 360 + IOTPersonalizationMobile applications

2016 Couchbase Inc.#

Fast (memory centric)FlexibleScalable

16

Big Data at a Glance CouchbaseSparkUse casesOperationalWeb / Mobile AnalyticsMachine LearningProcessing modeOnline Ad HocAd Hoc BatchLow latency =< 1ms opsSecondsPerformanceHighly predictableVariableUsers are typicallyMillions of customers100s of analysts or data scientistsMemory-centricMemory-centricBig data =10s of TerabytesPetabytes (?)

ANALYTICAL

OPERATIONAL

2016 Couchbase Inc.#

Use cases are totally different Spark is an execution engine, not a database. A prime use case for Hadoop is as a low cost data warehouse, which is not a good use case for Couchbase or Spark

Latency - Everyone says real time, but what do mean?For an operational system, this means:Extremely fast (in-memory) readsExtremely fast (log append) writes

For Couchbase, complete millions of ops / second (these are gets / sets) at latencies of under 1ms, compare LinkedIn figures from Jerry Franzs session

Tuned to LinkedIns specific workload: 75% writes (sets + incr) / 25% reads 13 byte values, 25 byte keys on average2.5 billion items (+ 1 replica)600 Gbytes of RAM / 3 Tbytes of disk in use on average

Average store latency ~ 0.4 milliseconds99th percentile store latency ~ 2.5 milliseconds

Average get latency ~ 0.8 milliseconds99th percentile get latency ~ 8 milliseconds

In general, Spark is just better at Hadoops core use cases than Hadoop (note, Im not talking about HDFS)Spark is much better for highly iterative algorithms and interactive queries which is important, given that the majority of jobs work on 100GB or less of data (small big data)

Spark scale less than Map Reduce based solutions on Hadoop, but thats OK [T]he majority of real-world analytics jobs process less than 100GB of input, but popular infrastructures such as Hadoop/MapReduce were originally designed for petascale processing. http://www.msr-waypoint.com/pubs/204499/a20-appuswamy.pdf

This is especially convenient for people with development background who like to run "stuff" (ad-hoc queries) on data in hadoop/hdfs. This remove the need to know about the underlying hadoop layer and just think of it as data.

17

Couchbase Spark Connector

FeaturesAutomatic cluster & resource managementCreate RDDs from KV, N1QL, ViewsCreate DStreams from DCP feedsPersist RDDs and DstreamsSupport for Datasets, DataFrames and SparkSQL

2016 Couchbase Inc.#

Analyze other 18

Couchbase & Spark for Machine Learning

HadoopData scientists train machine learning modelsLoad results into Couchbase so end users can interact with them onlineExamples including recommendations for content and products, flagging fraud or spamMachine Learning ModelsData WarehouseHistorical Data

2016 Couchbase Inc.#

Analyze other 19

DEMO TIME!

2016 Couchbase Inc.#

Hedgehats.com: Personalized Recommendations

node (e.g.)

PredictionsTraining Data(Observations)

Model

2016 Couchbase Inc.#

Learn More - Couchbase Spark ConnectorCouchbase Spark Connector - Sourcehttps://github.com/couchbase/couchbase-spark-connectorCode Sampleshttps://github.com/couchbaselabs/couchbase-spark-samples Talk: Spark with Couchbase to Electrify your Data Processinghttps://youtu.be/sBnAf7gAfLc Market Basket Analysis Sample App (Avalon)https://github.com/Avalon-Consulting-LLC/couchbase-spark-mba 22

2016 Couchbase Inc.#

22

Questions?Will Gardella, Product Manager@WillGardellawill.gardella@couchbase.com