automating big data technologies for faster time-to-value
TRANSCRIPT
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
November 1, 2017 | 11:00 AM PT
Automating Big Data
Technologies for Faster Time-
to-Value
© 2017, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Today’s PresentersDavid Potes, Solutions Architect, Amazon Web Services
Minesh Patel, Technical Director, Qubole
Seth Myers, Senior Data Scientist, Demandbase
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Today’s Agenda1. An overview of AWS and AWS Marketplace, with an emphasis on
AWS data lake solutions and Qubole
2. Overview of the Qubole solutions featured in our story
3. Challenges faced by Demandbase
4. The Demandbase success story with AWS and Qubole
5. Q&A/Discussion
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Learning Objectives1. How to dramatically reduce management complexities for analytics
operations
2. How to reduce the costs of processing and analyzing data in a data
lake on AWS
3. How to operate at the scale and efficiency of a large enterprise,
with a small data team
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Introduction to Data Lake
Concepts
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Unlocking Data
Most companies and organizations are embarking on ambitious innovation initiatives to unlock their data.
The data already exists but goes unused or is locked away from complimentary data sets in isolated data silos.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Enter Data Lake Architectures
Data Lake is a new and increasingly
popular architecture to store and analyze
massive volumes and heterogeneous
types of data.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefits of a Data Lake – All Data in One Place
Store and analyze all of your data,
from all of your sources, in one
centralized location.
“Why is the data distributed in
many locations? Where is the
single source of truth ?”
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefits of a Data Lake – Quick Ingest
Quickly ingest data
without needing to force it into a
pre-defined schema.
“How can I collect data quickly
from various sources and store
it efficiently?”
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefits of a Data Lake – Storage vs Compute
Separating your storage and compute
allows you to scale each component as
required
“How can I scale up with the
volume of data being generated?”
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefits of a Data Lake – Schema on Read
“Is there a way I can apply multiple
analytics and processing frameworks
to the same data?”
A Data Lake enables ad-hoc
analysis by applying schemas
on read, not write.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Approach to Data Lake
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 is the Data Lake
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Designed Benefits of an Amazon S3 Data Lake
Fixed Cluster Data Lake Amazon S3 Data Lake
• Limited to only the single tool contained
on the cluster (i.e. Hadoop or data
warehouse or Cassandra, etc.). Use
cases & ecosystem tools change
rapidly
• Expensive to add nodes to add storage
capacity
• Expensive to replicate data against
node loss
• Complexity in scaling local storage
capacity
• Long refresh cycles to add additional
storage equipment
• Decouple storage and compute by
making Amazon S3 object based
storage, not a fixed tool cluster the data
lake
• Flexibility to use any and all tools in the
ecosystem. The right tool for the job
• Future proof your architecture. As new
use cases and new tools emerge you
can plug and play current best of breed.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why Amazon S3 for Data Lake?
Designed for 11 9s
of durability
Designed for
99.99% availability
Durable Available High performance Multiple upload
Range GET
Store as much as you need
Scale storage and compute
independently
No minimum usage commitments
Scalable
Amazon EMR
Amazon Redshift
Amazon DynamoDB
Integrated
Simple REST API
AWS SDKs
Read-after-create consistency
Event notification
Lifecycle policies
Easy to use
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Automating Complex Tasks
Qubole makes Big Data technologies swift and simple
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
About Qubole
One of the largest cloud-
agnostic Big Data as a Service
companies
Founded by the pioneers of “big
data” @ Facebook and the
creators of Apache Hive
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Poll Question #1
What is the status of your big data initiative?
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Vision
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Qubole Data Service
Amazon
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Autonomous Data Management
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Qubole Cloud Agents
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Total Cost Savings Among Qubole Customers in 2016
and 2017
Cluster Life Cycle
Management$150M
Workload-aware
Autoscaling$121M
Spot Shopper
$40M
Cluster Life Cycle Management
Savings
– Amount saved by automatically
terminating a cluster when inactive
Workload-aware Auto-scaling Saving
– Amount saved by predictively adjusting
the number of nodes to meet demand
Spot Shopper savings
– Amount saved by utilizing SPOT
instances
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Architectural Diagram
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Poll Question #2
What big data technology are you using or evaluating?
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why Qubole?
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demandbase Automates With
QuboleDemandbase provides more value for their B2B marketing customers
by automating Big Data and Machine Learning operations.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Who is Demandbase?
Demandbase is a B2B marketing automation company that leverages
artificial intelligence to automate all aspects of the advertising, selling,
and marketing process.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Challenge
• Many factors determine which accounts a business should target
• Do they have a need/budget for the product?
• Are they currently in-market for the product?
• Do they have decision makers ready to buy?
• These insights must come from many different types of big datasets
• Demandbase’s previous account identification tool took multiple days to
run
• Our clients could not iterate or modify their strategies with such slow
turn-around
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Data Used to Identify Accounts
• To determine an account’s need for the product
• We have firmographic information on 14 Million accounts
• We’ve built a knowledge graph of all accounts using NLP
technology that crawls 350 TB of web pages a month
• To determine if an account is in-market
• We track 700 Billion web interactions a year, each one mapped
to employees across all accounts
• To identify decision makers
• We are currently tracking over a 100 Million employees across
all accounts
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
All 14M accounts are scored,
top 5K available to user
Keywords extracts from 700B
web interactions
Buyers at each account
identified from 100M+ contactsCompany 2
Company 3
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Solution
• The user requests a new list of accounts with a button-
press• 60 EC2 servers are spun up
• A machine learning algorithm is built using Spark and MLLIB
• For each of 14 Million accounts
• Information about relevant web interactions, buyers, online content, etc. fed into
machine learning model
• The model scores each account
• Top 5K accounts are pushed to web app, along with
relevant info
• From button-press to new account list – 20 minutes
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Qubole Makes This Possible
• Qubole manages all of our EC2 instances
• So far, we’ve tested 20 different concurrent models (20 X 60
EC2 servers) successfully
• Qubole keeps our costs down through dynamic bidding and
heterogeneous server clusters
• Our web app calls Qubole’s easy-to-implement Play API, which
spins up the EC2 instances and deploys our Spark job
• With Qubole taking care of the infrastructure, we could focus on
developing the machine learning
• Qubole allowed us to build a self-serve machine-learning-as-service
solution
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Next Steps and Further Information
• Try a pre-configured production-ready Qubole deployment on AWS Data Lake:
• https://aws.amazon.com/quickstart/architecture/qubole-on-data-lake-foundation/
• Buy on AWS Marketplace:
• https://aws.amazon.com/marketplace/pp/B06XX76R24
• Learn more about Qubole:
• https://www.qubole.com/products/qds-for-aws/
• Learn more about Demandbase:
• https://www.demandbase.com/technology/
• Try AWS:
• https://aws.amazon.com/
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Q & A
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank you!