big data: mejores prácticas en aws

43
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Javier Ros, Solution Architect Jun, 2016 2016 Big Data. Mejores prcticas en AWS

Upload: amazon-web-services

Post on 15-Apr-2017

172 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Big Data: Mejores prácticas en AWS

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Javier Ros, Solution Architect

Jun, 2016

2016 Big Data. Mejores practicas en

AWS

Page 2: Big Data: Mejores prácticas en AWS

Agenda

Big Data challenges

Design Patterns on AWS

RavenPack. Big Data for Financial Applications

Shopping cart

Page 3: Big Data: Mejores prácticas en AWS

Big Data Challenges

Volume

Velocity

Variety

Page 4: Big Data: Mejores prácticas en AWS

Simplify Big Data Processing

data answers

Time to Answer (Latency)

Throughput

Cost

ingest / collect

storeprocess /analyze

consume / visualize

Page 5: Big Data: Mejores prácticas en AWS

On-Demand Big Data Analytics

Young Huang. Director, Big Data Analytics.

“We were able to save about 90% over the EC2 ondemand cost”

Page 6: Big Data: Mejores prácticas en AWS

Clickstream Analysis

Suneel Sajnani. Senior VP of Enterprise Technology

Kinesis and Spark to process more than 30TB per day

Page 7: Big Data: Mejores prácticas en AWS

Event-driven Extract, Transform, Load (ETL)

Brian Filppu. Director of Business Intelligence

Kinesis, Lambda and EMR for 16 million events per day

Page 8: Big Data: Mejores prácticas en AWS

Smart Applications

Joe Emison. Founder & Chief Technology Officer

“Amazon Machine Learning democratizes the process of building predictive

models. It's easy and fast to use, and has machine-learning best practices

encapsulated in the product, which lets us deliver results significantly faster than

in the past.”

Page 9: Big Data: Mejores prácticas en AWS

June 2, 2016

RavenPackAWS SummitMadrid

Mapping the World’s Big Datafor Financial Applications

Jose Luis Cruz ‒ Operations [email protected]

Page 10: Big Data: Mejores prácticas en AWS

● What is RavenPack?

● Current Use Cases in the Cloud

● What’s Next?

Page 11: Big Data: Mejores prácticas en AWS

11ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

• RavenPack delivers big data analytics to financial professionals

• 80% of big data is unstructured

• Only 29% of decisions are based on big data.

RavenPack at a Glance

80% 29%

Page 12: Big Data: Mejores prácticas en AWS

12ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Top hedge funds and investment banks use RavenPack for trading and risk management

RavenPack processes hundreds of thousands of documents each day

We produce machine readable analytics for each document in real time <250ms

Archive of +300 million documents, over +20 years

RavenPack at a Glance

Page 13: Big Data: Mejores prácticas en AWS

13ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Classic Model

• 6 Servers, 19 KVM virtual machines

• Limited Storage - Expensive to Upgrade

• Multiple Points of Failure

Use Case: Realtime Classification

RDBMS

CollectorsRT Feed

Snapshots

Classifier

Files

Page 14: Big Data: Mejores prácticas en AWS

14ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Cloud Model using AWS

• CloudFormation to model the Stack

• Unlimited, Distributed Storage

• Easy redundancy, failover and backup

Use Case: Realtime Classification

Amazon

EC2

AWS

CloudFormation

Amazon

DynamoDB

Amazon

S3Amazon

RDS

Amazon

CloudSearch

Amazon

Redshift

Amazon

Kinesis

RT Feed

Snapshots

ClassifiersCollectors

Page 15: Big Data: Mejores prácticas en AWS

15ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Classic Model

• Same Limited Set of Servers, Same RDBMS

• Can affect Realtime System, Backups

• Full archive, 4-6 Classifiers → 6 weeks!

Use Case: History Classification

RDBMS FilesClassifiers

Classifiers

Page 16: Big Data: Mejores prácticas en AWS

16ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Cloud Model using AWS

• Servers on Demand, Distributed Storage

• Independent of Realtime System

• Full archive, 100 Classifiers → from 6 weeks to 3 days!

Use Case: History Classification

Amazon

EC2

AWS

CloudFormation

Amazon

DynamoDB

Amazon

S3

Amazon

RDS

Amazon

Redshift

Availability ZoneAvailability Zone

...

Classifiers

Coordinator

Page 17: Big Data: Mejores prácticas en AWS

17ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

• Structured BIG DATA available:

Consensus and estimates

Online purchases

Bank and credit card transactions

Satellite photographical information

• Can improve current analytics or create new ones

• Challenges

Amount of data available

Mapping all those different datasets

• Solution: Kinesis + RedShift + EMR

Future: Incorporating Structured Data

Amazon

S3

Amazon

Redshift

Amazon

EC2

Amazon

EMR

Amazon

Kinesis

Page 18: Big Data: Mejores prácticas en AWS

18ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Download a Custom “Slice” of Analytics Data

• Provide a Web-API and Web Service

• Let client specify parameters

Data Set and Time Range

Entities and Events

Filters

• Leverage Amazon RedShift and S3

• Compression and Multiple Output Formats

Future: Self-Service Data

Amazon

S3

Amazon

Redshift

Amazon API

Gateway

Amazon

EC2AWS

Lambda

Page 19: Big Data: Mejores prácticas en AWS

19ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

• Let Clients upload Proprietary Contentto the Amazon Virtual Private Cloud (VPC)

Internal documents / research

Email, Instant Messaging

CRM, bug tracking system

Client Support Calls transcriptions

...

• Provision Computing and Storage Resourceson a Per Project Basis

• View Private Analytics in Isolation or AlongsideStandard RavenPack Analytic DataSets

• Everything Goes Away when Project Completes

Future: The RavenPack Cloud

Amazon

DynamoDBAmazon

RDS

Amazon

S3

Amazon

Redshift

Amazon

EC2

AWS

CloudFormation

Amazon

CloudSearch

Page 20: Big Data: Mejores prácticas en AWS

RavenPack International

Thanks for listening!

ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +44 (0) 782 783 8282

Jose Luis Cruz: [email protected]

Page 21: Big Data: Mejores prácticas en AWS

Shopping Cart

http://amzn.to/BigDataSummit

Page 22: Big Data: Mejores prácticas en AWS

Shopping cart

Page 23: Big Data: Mejores prácticas en AWS

Business Metrics

Time to buy

Time to cancel

Number of sales

Sales per country

% buy

Page 24: Big Data: Mejores prácticas en AWS

Architecture

client

mobile client

API Server

Cart event

Cart event

Amazon

KinesisAmazon

S3

Amazon

S3Amazon

EMR

Amazon

Redshift

Amazon Machine

Learning

Amazon

QuickSight

Page 25: Big Data: Mejores prácticas en AWS

Customer events

{

“type”: “productAdded”,

“timestamp”: 1462465948,

“customer”: 5,

“cart”: 203438,

“product”: 937293

}

{

“type”: “productRemoved”,

“timestamp”: 1462465948,

“customer”: 5,

“cart”: 203438,

“product”: 937293

}

{

“type”: “cartBuy”,

“timestamp”: 1462465948,

“customer”: 5,

“cart”: 203438

“productlist”: [34, 253]

}

{

“type”: “cartDiscard”,

“timestamp”: 1462465948,

“customer”: 5,

“cart”: 203438,

“productlist”: [2353, 1355, 1234]

}

Page 26: Big Data: Mejores prácticas en AWS

Amazon Kinesis Firehose

Page 27: Big Data: Mejores prácticas en AWS

Amazon Kinesis Firehose

Page 28: Big Data: Mejores prácticas en AWS

Architecture

client

mobile client

API Server

Cart event

Cart event

Amazon

KinesisAmazon

S3

Amazon

S3Amazon

EMR

Amazon

Redshift

Amazon Machine

Learning

Amazon

QuickSight

AWS Data

Pipeline

Page 29: Big Data: Mejores prácticas en AWS

AWS Data Pipeline

Page 30: Big Data: Mejores prácticas en AWS

AWS Elastic MapReduce

Page 31: Big Data: Mejores prácticas en AWS

Pig script

DATA = LOAD 's3://shoppingcart-summit/streams/$inputdate/*' USING JsonLoader('type:chararray,timestamp:int,customer:int,cart:long,product:chararray, productlist: chararray');

DATA2 = FILTER DATA BY type is not null;

CARTS = GROUP DATA2 BY cart;

CARTDATA = FOREACH CARTS {

LOGIN = FILTER DATA2 BY type == 'login';

ADDED = FILTER DATA2 BY type == 'productAdded';

REMOVED = FILTER DATA2 BY type == 'productRemoved';

BUY = FILTER DATA2 BY type == 'cartBuy';

GENERATE MAX(DATA2.customer) AS customer, group AS cart,

MAX(DATA2.timestamp)-MIN(DATA2.timestamp) AS duration, IsEmpty(BUY) AS buy,

COUNT_STAR(ADDED) AS added, COUNT_STAR(REMOVED) AS removed,

MAX(DATA2.timestamp)-MAX(ADDED.timestamp) AS thinking,

MIN(LOGIN.timestamp) AS timestamp, '\"\"';

};

STORE CARTDATA INTO 's3://shoppingcart-summit/redshift/$inputdate/' USING PigStorage(',');

Page 32: Big Data: Mejores prácticas en AWS

AWS Quicksight

Page 33: Big Data: Mejores prácticas en AWS

Architecture

client

mobile client

API Server

Cart event

Cart event

Amazon

KinesisAmazon

S3

Amazon

S3Amazon

EMR

Amazon

Redshift

Amazon Machine

Learning

Amazon

QuickSight

AWS Data

Pipeline

Page 34: Big Data: Mejores prácticas en AWS

Machine learning and smart applications

Machine learning is the technology that

automatically finds patterns in your data and

uses them to make predictions for new data

points as they become available

Your data + machine learning = smart applications

Page 35: Big Data: Mejores prácticas en AWS

Introducing Amazon Machine Learning

Easy to use, managed machine learning service built for developers

Robust, powerful machine learning technology based on Amazon’s internal systems

Create models using your data already stored in the AWS cloud

Deploy models to production in seconds

Page 36: Big Data: Mejores prácticas en AWS

Trainmodel

Evaluate andoptimize

Retrieve predictions

1 2 3

Building smart applications with Amazon ML

- Create a Datasource object pointing to the shopping cart

processed data

- Explore and understand your data

- Transform data and train your model

Page 37: Big Data: Mejores prácticas en AWS

Trainmodel

Evaluate andoptimize

Retrieve predictions

1 2 3

Building smart applications with Amazon ML

- Understand model quality

- Adjust model interpretation

Page 38: Big Data: Mejores prácticas en AWS

Explore model quality

Page 39: Big Data: Mejores prácticas en AWS

Fine-tune model interpretation

Page 40: Big Data: Mejores prácticas en AWS

Trainmodel

Evaluate andoptimize

Retrieve predictions

1 2 3

Building smart applications with Amazon ML

- Batch predictions

- Real-time predictions

Page 41: Big Data: Mejores prácticas en AWS

Real-time predictions for interactive applications

Your application

Query for predictions with

Amazon ML real-time API

ml = boto3.client('machinelearning')

prediction = ml.predict(

MLModelId='ml-dZxbrDXAstA',

Record={

'customer': '4634',

’cart': '13661535770434', …

},

PredictEndpoint='https://realtime.machinelearning….’

)

Page 42: Big Data: Mejores prácticas en AWS

Architecture

client

mobile client

API Server

Cart event

Cart event

Amazon

KinesisAmazon

S3

Amazon

S3Amazon

EMR

Amazon

Redshift

Amazon Machine

Learning

Amazon

QuickSight

AWS Data

Pipeline

http://amzn.to/BigDataSummit

Page 43: Big Data: Mejores prácticas en AWS