automating big data technologies for faster time-to-value

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

November 1, 2017 | 11:00 AM PT

Automating Big Data

Technologies for Faster Time-

to-Value

© 2017, Amazon Web Services, Inc. or its affiliates. All rights reserved.


Today’s PresentersDavid Potes, Solutions Architect, Amazon Web Services

Minesh Patel, Technical Director, Qubole

Seth Myers, Senior Data Scientist, Demandbase


Today’s Agenda1. An overview of AWS and AWS Marketplace, with an emphasis on

AWS data lake solutions and Qubole

2. Overview of the Qubole solutions featured in our story

3. Challenges faced by Demandbase

4. The Demandbase success story with AWS and Qubole

5. Q&A/Discussion


Learning Objectives1. How to dramatically reduce management complexities for analytics

operations

2. How to reduce the costs of processing and analyzing data in a data

lake on AWS

3. How to operate at the scale and efficiency of a large enterprise,

with a small data team


Introduction to Data Lake

Concepts


Unlocking Data

Most companies and organizations are embarking on ambitious innovation initiatives to unlock their data.

The data already exists but goes unused or is locked away from complimentary data sets in isolated data silos.


Enter Data Lake Architectures

Data Lake is a new and increasingly

popular architecture to store and analyze

massive volumes and heterogeneous

types of data.


Benefits of a Data Lake – All Data in One Place

Store and analyze all of your data,

from all of your sources, in one

centralized location.

“Why is the data distributed in

many locations? Where is the

single source of truth ?”


Benefits of a Data Lake – Quick Ingest

Quickly ingest data

without needing to force it into a

pre-defined schema.

“How can I collect data quickly

from various sources and store

it efficiently?”


Benefits of a Data Lake – Storage vs Compute

Separating your storage and compute

allows you to scale each component as

required

“How can I scale up with the

volume of data being generated?”


Benefits of a Data Lake – Schema on Read

“Is there a way I can apply multiple

analytics and processing frameworks

to the same data?”

A Data Lake enables ad-hoc

analysis by applying schemas

on read, not write.


AWS Approach to Data Lake


Amazon S3 is the Data Lake


Designed Benefits of an Amazon S3 Data Lake

Fixed Cluster Data Lake Amazon S3 Data Lake

• Limited to only the single tool contained

on the cluster (i.e. Hadoop or data

warehouse or Cassandra, etc.). Use

cases & ecosystem tools change

rapidly

• Expensive to add nodes to add storage

capacity

• Expensive to replicate data against

node loss

• Complexity in scaling local storage

capacity

• Long refresh cycles to add additional

storage equipment

• Decouple storage and compute by

making Amazon S3 object based

storage, not a fixed tool cluster the data

lake

• Flexibility to use any and all tools in the

ecosystem. The right tool for the job

• Future proof your architecture. As new

use cases and new tools emerge you

can plug and play current best of breed.


Why Amazon S3 for Data Lake?

Designed for 11 9s

of durability

Designed for

99.99% availability

Durable Available High performance Multiple upload

Range GET

Store as much as you need

Scale storage and compute

independently

No minimum usage commitments

Scalable

Amazon EMR

Amazon Redshift

Amazon DynamoDB

Integrated

Simple REST API

AWS SDKs

Read-after-create consistency

Event notification

Lifecycle policies

Easy to use


Automating Complex Tasks

Qubole makes Big Data technologies swift and simple


About Qubole

One of the largest cloud-

agnostic Big Data as a Service

companies

Founded by the pioneers of “big

data” @ Facebook and the

creators of Apache Hive


Poll Question #1

What is the status of your big data initiative?


The Vision


Qubole Data Service

Amazon


Autonomous Data Management


Qubole Cloud Agents


Total Cost Savings Among Qubole Customers in 2016

and 2017

Cluster Life Cycle

Management$150M

Workload-aware

Autoscaling$121M

Spot Shopper

$40M

Cluster Life Cycle Management

Savings

– Amount saved by automatically

terminating a cluster when inactive

Workload-aware Auto-scaling Saving

– Amount saved by predictively adjusting

the number of nodes to meet demand

Spot Shopper savings

– Amount saved by utilizing SPOT

instances


Architectural Diagram


Poll Question #2

What big data technology are you using or evaluating?


Why Qubole?


Demandbase Automates With

QuboleDemandbase provides more value for their B2B marketing customers

by automating Big Data and Machine Learning operations.


Who is Demandbase?

Demandbase is a B2B marketing automation company that leverages

artificial intelligence to automate all aspects of the advertising, selling,

and marketing process.


The Challenge

• Many factors determine which accounts a business should target

• Do they have a need/budget for the product?

• Are they currently in-market for the product?

• Do they have decision makers ready to buy?

• These insights must come from many different types of big datasets

• Demandbase’s previous account identification tool took multiple days to

run

• Our clients could not iterate or modify their strategies with such slow

turn-around


The Data Used to Identify Accounts

• To determine an account’s need for the product

• We have firmographic information on 14 Million accounts

• We’ve built a knowledge graph of all accounts using NLP

technology that crawls 350 TB of web pages a month

• To determine if an account is in-market

• We track 700 Billion web interactions a year, each one mapped

to employees across all accounts

• To identify decision makers

• We are currently tracking over a 100 Million employees across

all accounts


All 14M accounts are scored,

top 5K available to user

Keywords extracts from 700B

web interactions

Buyers at each account

identified from 100M+ contactsCompany 2

Company 3


The Solution

• The user requests a new list of accounts with a button-

press• 60 EC2 servers are spun up

• A machine learning algorithm is built using Spark and MLLIB

• For each of 14 Million accounts

• Information about relevant web interactions, buyers, online content, etc. fed into

machine learning model

• The model scores each account

• Top 5K accounts are pushed to web app, along with

relevant info

• From button-press to new account list – 20 minutes


Qubole Makes This Possible

• Qubole manages all of our EC2 instances

• So far, we’ve tested 20 different concurrent models (20 X 60

EC2 servers) successfully

• Qubole keeps our costs down through dynamic bidding and

heterogeneous server clusters

• Our web app calls Qubole’s easy-to-implement Play API, which

spins up the EC2 instances and deploys our Spark job

• With Qubole taking care of the infrastructure, we could focus on

developing the machine learning

• Qubole allowed us to build a self-serve machine-learning-as-service

solution


Next Steps and Further Information

• Try a pre-configured production-ready Qubole deployment on AWS Data Lake:

• https://aws.amazon.com/quickstart/architecture/qubole-on-data-lake-foundation/

• Buy on AWS Marketplace:

• https://aws.amazon.com/marketplace/pp/B06XX76R24

• Learn more about Qubole:

• https://www.qubole.com/products/qds-for-aws/

• Learn more about Demandbase:

• https://www.demandbase.com/technology/

• Try AWS:

• https://aws.amazon.com/

https://aws.amazon.com/quickstart/architecture/qubole-on-data-lake-foundation/

https://aws.amazon.com/marketplace/pp/B06XX76R24

https://www.qubole.com/products/qds-for-aws/

https://www.demandbase.com/technology/

https://aws.amazon.com/


Q & A


Thank you!

automating big data technologies for faster time-to-value

Documents