low latency polyglot model scoring using apache apex · challenges today data processing patterns...

28
Challenges today Data processing patterns Model Scoring patterns Q&A YOW DATA 2017 Ananth Gundabattula Low latency polyglot model scoring using Apache Apex

Upload: others

Post on 20-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Challenges today Data processing patterns

Model Scoring patterns Q&A

YOW DATA 2017

Ananth Gundabattula

Low latency polyglot model scoring using Apache Apex

Challenges today Data Processing patterns

Model scoring patterns Q&A

Yin Yang world

• Machine learning enabled applications are quickly becoming the norm

• Low latency applications put additional need to build applications using sound data engineering principles

• A perfect application needs these two communities to work in tandem

Challenges today Data Processing patterns

Model scoring patterns Q&A

Current Approach

• Machine learning libraries are not developed for • low latency execution environments • Data pipeline complexities • Some of them might not even be distributed compute engines.

• A common practise is to “migrate” the model into the production application

• Involves coding the model in the target application framework

Challenges today Data Processing patterns

Model scoring patterns Q&A

Agenda

• Apache Apex introduction • Data processing patterns • Model scoring patterns • Q&A

Challenges today Data Processing patterns

Model scoring patterns Q&A

Low latency Distributed Streaming

Enterprise grade features - Highly customisable DAG - Checkpointing - End to End Exactly once - Hadoop Security

compatible - YARN enabled

1. 2. 3. 4.

Apache Apex Introduction

Challenges today Data Processing patterns

Model scoring patterns Q&A

Streaming low latency engine

<100 ms Fraud vs No Fraud

Time dimension

Logical unit 1 Logical unit 2 Logical unit 3

VS

Time dimension

Logical unit 1 Logical unit 2 Logical unit 3

Challenges today Data Processing patterns

Model scoring patterns Q&A

Distributed Execution Engine

Scale up & Scale out

Time dimension

Logical unit 1a Logical unit 2a Logical unit 3

Time dimension

Logical unit 1b Logical unit 2b Logical unit 3

Time dimension

Logical unit 1c Logical unit 2c Logical unit 3

•YARN enabled • Resource Managed

• MESOS support on the roadmap

Challenges today Data Processing patterns

Model scoring patterns Q&A

Apex application layout

REST API

Bare Metal/ Cloud/Virtual

Hadoop - HDFS + YARN All major distributions supported

Apex Streaming Runtime High performance,Fault Tolerant,In-memory

App nApp 2App 1…

APEX STACK APEX-CLI

Command line clientYARN RESOURCE MANAGER

YARN NM

M

C

C

YARN NM

C

C

M

YARN NM

C

M

C

Submit Job/ Manipulate

Highly Available App Master (M) - Apex AM

Highly available compute units

HDFS for Persistent Storage

Capita

lise o

n

exist

ing H

adoo

p

inve

stmen

ts

Varied compute profiles

Colocate YARN traditional MR along with Apex Applications - No Disruption

Challenges today Data Processing patterns

Model scoring patterns Q&A

Apex application development model•Stream is a sequence of data tuples • An Operator consumes one or more input streams ,

processes tuples using custom business logic and emits to one or more output streams

• DAG is made up of operators and streams •Rich collection of operators available from Apache Malhar

• NOSQL - Cassandra, Geode .. • Relational - JDBC, Kudu,.. • Messaging - Kafka, JMS , Solace • File Systems - HDFS , S3, NFS • Nifi •….

Kafka Input

Kafka Input

Kafka Input

Kudu Output

Kudu output

Cassandra Enrich

Cassandra Enrich

Cassandra Enrich

Model

Scoring

Model

Scoring

Model

Scoring

Model

Scoring

Model

Scoring

Model

Scoring

KI CE MS KO

Logical model

Challenges today Data Processing patterns

Model scoring patterns Q&A

Apex application deployment model

• Non-intrusive model to meet overall SLAs • Different operators can be configured independently to meet SLA needs. Examples :

•Model scoring needs more compute resources and Kafka needs topic partition aligned throughput & offset management

• Extend Malhar Kafka Input operator to emit a business domain model from a byte[] • Custom stream codecs enable configurable tuple routing patterns

KI

KI

KI

KO

KO

CE

CE

CE

MS

MS

MS

MS

MS

MS

Each operator needs its own scaling to

meet SLAs

Challenges today Data Processing patterns

Model scoring patterns Q&A

Unifiers

Functionality specific scaling is causing backpressure on

downstream operatorsKI CE MS KO

CE

CE

CE

CE

CE

MS

MS

U

CE

CE

CE

CE

CE

MS

MS

U

U

Logical Plan

Scaled up unifiers

Bottleneck @unifier

Challenges today Data Processing patterns

Model scoring patterns Q&A

Parallel partitioning

I want to avoid shuffles for the lowest cross operator latencies

KI CE MS KOCO

Logical Plan

CEKI MS

CEKI MS

CEKI MS

CO KO

Parallel Partitioning

Challenges today Data Processing patterns

Model scoring patterns Q&A

Dynamic partitioning

• Utilise hardware for nightly batch compute needs

But most of the streaming feeds are only

during day time KI CE MS KOCO

Logical Plan

KI

CEKI MS

KI

CO KO

Dynamic Partitioning

CEKI MS

CEKI MS

CEKI MS

CO KO

Daytime topology Nighttime topology

Challenges today Data Processing patterns

Model scoring patterns Q&A

In-memory pub sub for recoverability

• High performant in-memory pub-sub messaging • Provides ordering & idempotency for failure scenarios • Buffers tuples in memory until the tuples are committed • Spills to disk in back-pressure scenarios

I want a loosely coupled operator binding for throughput handling,

recovery … T T T T T T

Cassandra Operator JVMT T T T T T

Model scoring Operator

CEKIMS

MSXX

CEKIMS

MS

Recoverability in parts of DAG

Challenges today Data Processing patterns

Model scoring patterns Q&A

Checkpoints

• Non-intrusive streaming window markers in the stream • In-memory processing of data & checkpointing at streaming window boundaries • Configurable checkpoint store - HDFS backed store replicated & highly available • One or multiple streaming windows ( configurable ) make a checkpoint boundary

• In example above, CP = 2 windows ( and is a multiple of application window time) • Persist non-transient & Operator specific checkpointing data structure - Ex: Kafka : (C,T,P,O)

But machines fail / Application

needs upgrade

T T T T T T T T T T T T T T TB B B B BE E E E ECP CP

R R

Streaming Window

CP Checkpoint state

B Begin Streaming Window E End Streaming Window

Challenges today Data Processing patterns

Model scoring patterns Q&A

Processing guaranteesBut machines fail /

Application needs upgrade

B T T T T T T T T T T TB B BE E ECP

R

X

At Least Once

T T T T T T T T T T TB B BE E ECP

R

X

At most Once

T T T BE

To downstream Operator

To downstream Operator

Exactly Once = At least once + Idempotency ( Pub-Sub ) + Operator logic

Challenges today Data Processing patterns

Model scoring patterns Q&A

Example pattern for end to end exactly once

I want exactly once semantics but NoSQL

stores do not support transactions

Exactly Once = At least once + Idempotency ( Pub-Sub ) + Operator logic

T T T T T T T T T T TB B BE E ECP

R

X

Exactly Once - Upstream window processing view

T T TBET T T

T

Safe Mode Window/s

Reconciling Mode

Normal Mode

Automatically Skipped

Business Logic callback to Exactly detect already written records

Challenges today Data Processing patterns

Model scoring patterns Q&A

Apex ecosystem support for Model scoring

There are so many options to build a model..

• R • Python ( Support coming soon )

• Scikit , numpy • H20 •SAMOA • DL4J

Challenges today Data Processing patterns

Model scoring patterns Q&A

Apex R Operator

R is my favourite ML framework as it is …. Apex R Operator JVM

R SCRIPT

•REngine instance within the same memory process of the operator JVM

• Data values in Java are pushed to the R via JRI bridge

• R Script is located and loaded into memory at startup time of the operator

• REngine instance is reused for each tuple

REngine

Challenges today Data Processing patterns

Model scoring patterns Q&A

Apex Python Operator ( Pending JIRA integration )

Python is my choice as it is …. •Python interpreter

embedded into the JVM • All CPython dependencies

automatically pulled into the execution loop

• First class support for Numpy • Same memory reference

location across JVM and Python for Numpy Datastructures!

• Better GIL management • Just drop the serialized

model in the class path

Python Apex Operator JVM

PYTHON SCRIPT

JEP ( Java Embedded Python )NumPy

Challenges today Data Processing patterns

Model scoring patterns Q&A

Apex H20 support

H20 is my choice as it is ….

•H2o POJO&MOJO approaches work out of the box

• No extra implementation required.

• H20 models can be scored inline in other operators if need be for low latencies

Generic Operator JVM

Challenges today Data Processing patterns

Model scoring patterns Q&A

Canary model deployment pattern

Want to deploy a model as a canary before enabling it

Kafka Input

Kafka Input

Kafka Input

Cassandra Enrich

Cassandra Enrich

Cassandra Enrich

Model

Scoring

Model

Scoring

Model

Scoring

Model

Scoring

Model

Scoring

Model

Scoring

Canary

Model

Kafka Input Cassandra Enrich

Model

Scoring

LOGICAL MODEL

•CE - Cassandra Enrich emits to Canary model if from Kafka partition 0 else emits to current model

• Sink DAG not shown above • Customised scaling of canary

PHYSICAL MODEL

Challenges today Data Processing patterns

Model scoring patterns Q&A

Production is your staging environment - Dormant models

Want to deploy a model that scores but does not form

part of the response

• Send tuples to both active and dormant • Make use of Cassandra output operator that supports

• Dynamic columns - NoSQL patterns for non-collision of columns • TTL out dormant data after n days if required

• Real time response is being handled on the TCP socket operator

Canary

Model

Kafka Input Cassandra Enrich

Model

Scoring

LOGICAL MODEL

TCP Socket

Cassandra Persist

Challenges today Data Processing patterns

Model scoring patterns Q&A

Ensemble of models

Want to go with an ensemble to get maximum uplift in my

application.

• Logical ensembles can be easily overlaid on a DAG • Each model of the ensemble can itself be from different machine

learning frameworks

Canary

Model

Kafka Input Cassandra Enrich

Model

Scoring

LOGICAL MODEL

TCP Socket response

Ensemble

Of 2 models

Challenges today Data Processing patterns

Model scoring patterns Q&A

Apex command line client

Want a hot deploy option to fine tune my model thresholds/add a

model already packaged

APEX-CLI YARN RESOURCE MANAGER

YARN NM

M

C

C

YARN NM

C

C

M

YARN NM

C

M

C

Submit Job/

• Change thresholds of a model when the app is still running • Change the DAG

• Add a dormant model into the DAG

Challenges today Data Processing patterns

Model scoring patterns Q&A

Customization example - Model Scoring CI/CD pipelines

Want to refresh my model parameters independent of

deployment team involvement

• Each Operator in the DAG is • Extendable • Customisable

• Python scoring operator can be extended to • Automatically fetch the

latest pickled model from a git repo every 1 hour

• Provides some independence to the data scientists.

Git aware python

operatorKafka Input Cassandra

Enrich

Python Operator

Challenges today Data Processing patterns

Model scoring patterns Q&A

Some production references

• GE prefix platform processes IOT streaming data for analytics at sub-millisecond time frames • Capitol One

• 99.999 % uptime 24x7 • Single digit millisecond end to end latencies

• Threatmetrix data pipelines for visualising fraud patterns were processed at single digit millisecond processing latencies • These times exclude the latencies to write to a Cassandra cluster

• A leading global financial institution ( non-AUS) • Demonstrate AML compliance • Integrate with Teradata, Vertica and Hadoop

Challenges today Data processing patterns

Model Scoring Patterns Q&A

?• Apex Community http://

apex.apache.org/community.html • Docs http://apex.apache.org/docs.html •Powered by Apache Apex http://

apex.apache.org/powered-by-apex.html • REST-API Server  https://github.com/

atrato/atrato-server • Twitter handle https://twitter.com/

apacheapex • Examples https://github.com/apache/

apex-malhar/tree/master/examples https://www.linkedin.com/in/ananth-kalyan-chakravarthy-ph-d-7a46156/

@_ananth_g