capital one's next generation decision in less than 2 ms
TRANSCRIPT
12
Hard Metrics Goal
Latency < 40msIdeally < 16ms
Throughput Goal of 2000 events / second
Durability No loss, every message gets exactly one response
Availability 99.5% uptime (downtime of 1.83 days / year);Ideally 99.999% uptime (downtime of 5.26 minutes / year)
Scalability Can add resources, still meet latency requirements
Integration Transparently connected to existing systems – Hardware, Messaging, HDFS
Soft Metrics Goal
Open Source All components licensed as open source
Extensibility Rules can be updated, model is regularly refreshed
26
• Avg. 0.25ms, @70k records/sec, w/ 600GB RAM
Thread Local on ~54M eventsPercentiles (in ms)
Throughput Count Avg (ms) 90% 95% 99% 99.9% 4 9’s 5 9’s 6 9’s
70k/sec 54,126,122 0.19 1 1 1 2 2 5 6
Performance
27
Durability
• Two physically independent pipelines on the same cluster processing identical data
• For the same tuple, we find the best-case time between two pipelines– 39 records out of 5.2M exceeded 16ms – 173 out of 5.2M exceeded 16ms in one pipeline but succeeded in the other
• 99.99925% success rate – “Five Nines”•Average Latency of 0.0981ms
31
Streaming Technologies Evaluated
• Spark Streaming• Samza• Storm• Feedzai• Infosphere Streams• Flink• Ignite• VoltDB• Cassandra• Apex
• Of all evaluated technologies, Apache Apex is the only technology that is ready to bring the decision making solution to production based on:– Maturity– Fault-tolerance– Enterprise-readiness– Performance
• Focus on open source• Drive Roadmap• Competitive Advantage for C1
32
Stream Processing – Apache Storm
• An open-source, distributed, real-time computation system– Logical operators (spouts and bolts) form statically parallelizable topologies– Very high throughput of messages with very low latency– Can provide <10ms latency end-end under normal operation
• Basic abstractions provide an at-least-once processing guarantee
Limitations• Nimbus is a single point of failure
– Rectified by Hortonworks, but not yet available to the public (no timeline for release)• Upstream bolt/spout failure triggers re-compute on entire tree
– Can only create parallel independent stream by having separate redundant topologies• Bolts/spouts share JVM Hard to debug• Failed tuples cannot be replayed quicker than 1s• No dynamic topologies• Cannot add or remove applications without service interruption
33
Stream Processing – Apache Flink
• An open-source, distributed, real-time computation system– Logical operators are compiled into a DAG of tasks executed by Task Managers– Supports streaming, micro-batch, batch compute– Supports aggregate operations on streams (reduce, join, groupBy)– Capable of <10 ms end-end latency with streaming under normal operation
• Can provide exactly-once processing guarantees
Limitations• Failures trigger reset of ALL operators to last checkpoint
– Depends on upstream message broker to track state• Operators share JVM
– Failure in one brings down all tasks sharing that JVM– Hard to debug
• No dynamic topologies• Young community, young product
34
Stream Processing – Apache Apex
• An open-source, distributed, real-time computation system on YARN• Apex is the core system powering DataTorrent, released under ASF• Demonstrated high throughput with low latency running a next-generation
C1 model (avg. 0.25ms, max 2ms, @ 70k records/sec) w/ 600GB RAM• True YARN application developed on the principles of Hadoop and YARN
at Yahoo!
• Mature product – Core principles of Apex are derived from a proven solution in Yahoo Finance and
Yahoo hadoop. – Operability in Apex is first class citizen with focus on Enterprise capabilities
• DataTorrent (Apex) is executing on production clusters at Fortune 100 companies.
35
Stream Processing – Apache Apex
Maturity• Designed to process and manage global data for Yahoo! Finance
– Primary focus is on stability, fault-tolerance and data management– Only OSS streaming technology considered designed explicitly for the financial world
• Data or computation could never be lost or replicated• Architecture had to never go down • Goal was to make it rock-solid and enterprise-ready before worrying about performance
• Data flow across countries – perfect for use-case that requires cross-cluster interaction
Enterprise Readiness• Advanced support for:
– Encryption, authentication, compression, administration, and monitoring– Deployment at scale in the cloud and on-prem – AWS, Google Cloud, Azure
• Integrates with huge set of existing tools:– HDFS, Kafka, Cassandra, MongoDB, Redis, ElasticSearch, CouchDB, Splunk, etc.
36
Apex Platform – Summary
• Apex Architecture– Networks of physically independent, parallelizable operators that scale dynamically– Dynamic topology modification and deployment– Self-healing, fault tolerant, & recoverable
• Durable messaging queues between operators, check-pointed in memory and on disk• Resource manager is a replicated YARN process, monitors and restarts downed operators
– No single point of failure, highly modular design– Can specify locality of execution (avoids network and inter-process latency)
• Guarantees at-least-once, at-most-once, or exactly-once processing
Directed Acyclic Graph (DAG)
Filtered
Stream
Output Stream
Tuple Tuple
Filtered Stream
Enriched Stream
Enriched
Stream
er
Operator
er
Operator
er
Operator
er
Operator
39
Apex Platform – Cluster View
Hadoop Edge Node
DT RTS Management
Server
Hadoop Node
YARN ContainerRTS App Master
Hadoop Node
YARN ContainerYARN Container
YARN Container
Thread1
Op2
Op1
Thread-N
Op3
Streaming Container
Hadoop Node
YARN ContainerYARN Container
YARN Container
Thread1
Op2
Op1
Thread-N
Op3
Streaming Container
CLI
REST API
DT RTS Management
Server
REST API
Part of Community Edition
40
Apex Platform – Operators
• Operators can be dynamically scaled
• Flexible stream configuration• Parallel Redis / HDHT DAGs• Separate visualization DAG
• Parallel partitioning• Durability of data• Scalability• Organization for in-memory store
• Unifiers• Combine statistics from physical
partitions
41
Dynamic Topology Modification
• Can redeploy new operators and models at run-time!
• Can reconfigure settings on the fly
42
Apex Platform – Failure Recovery
• Physical independence of partitions is critical
• Redundant STRAMs
• Configurable window size and heartbeat for low-latency recovery
• Downstream failures do not affect upstream components – Snapshotting only depends on previous operator, not all previous operators– Can deploy parallel DAGs with same point of origin (simpler from a hardware and
deployment perspective)
43
Apex Platform – Windowing
• Sliding window and tumbling window
• Window based on checkpoint
• No artificial latency
• Used for stats measurement
44
• Apex– Great UI to monitor, debug, and control system performance– Fault-tolerance and recovery out of the box - no additional setup, or improvement
needed• YARN is still a single point of failure, a name node failure can still impact the system
– Built-in support for dynamic and automatic scaling to handle larger throughputs– Native integration with Hadoop, YARN, and Kafka – next-gen standard at C1– Mature product
• Principles derived from years at Yahoo Finance and Yahoo Hadoop• Built and planned by deep Hadoop and streaming experts
– Proven performance in production at Fortune 100 companies
Enterprise Readiness
45
Enterprise Readiness
• Storm – Widely used but abandoned by creators at Twitter for Heron in production
• Storm debug-ability - topology components are bundled in one process• Resource demands
– Need dedicated hardware– Can’t scale on demand or share usage
• Topology creation/tear-down is expensive, topologies can’t share cluster resources– Have to manually isolate & de-commission machines
– Performance in failure scenarios is insufficient for this use-case
• Flink– Operational performance has not been proven
• Only one company (ResearchGate) officially uses Flink in production– Architecture shares fundamental limitations of Storm with regards to
dynamically scaling operators & topologies and debugability– Performance in failure scenarios is insufficient for this use-case
46
Performance
• Storm – Meets latency and throughput requirements only when no failures occur. – Resilience to failures only possible by running fully independent clusters– Difficult to debug and operationalize complex systems (due to shared JVM and poor
resource management)• Flink
– Broader toolset than Storm or Apex – ML, batch processing, and SQL-like queries– Meets latency and throughput requirements only when no failures occur. – Failures reset ALL operators back to the source – resilience only possible across
clusters– Difficult to debug and operationalize complex systems (due to shared JVM)
• Apex– Supports redundant parallel pipelines within the same cluster– Outstanding latency and throughput even in failure scenarios– Self-healing independent operators (simple to isolate failures)– Only framework to provide fine-grained control over data and compute locality
47
Roadmap – Storm
• Commercial support from from Hortonworks but limited code contributions
• Twitter - Storm’s largest user - has completely abandoned Storm for Heron• Business Continuity
– Enhance Storm’s enterprise readiness with high availability (HA) and failover to standby clusters
– Eliminate Nimbus as a single point of failure• Operations
– Apache Ambari support for Nimbus HA node setup– Elastic topologies via YARN and Apache Slider. – Incremental improvements to Storm UI to easily deploy, manage and monitor
topologies.• Enterprise readiness
– Declarative writing of spouts, bolts, and data-sources into topologies
48
Roadmap – Flink
• Fine-grained fault tolerance (avoid rollback to data source) – Q2 2015
• SQL on Flink – Q3/Q4 2015
• Integrate with distributed memory storage – No ECD
• Use off-heap memory – Q1 2015
• Integration with Samoa, Tez, Mahout DSL – No ECD
49
Roadmap – Apex
• Roadmap for next 6 months
• Support creation of reusable pluggable modules (topologies)• Add additional operators to connect to existing technology
– Databases– Messaging– Modeling systems
• Add additional SQL-like operations– Join– Filter– GroupBy– Caching
• Add ability to create cycles in graph– Allows re-use of data for ML algorithms (similar to Spark’s caching)
50
Road Map Comparison
• Storm– Roadmap is intended to bring Storm to enterprise readiness Storm is not enterprise
ready today according to Hortonworks• Flink
– Roadmap brings Flink up to par with Spark and Apex, does not create new capabilities relative to either
– Spark is more mature for batch-processing and micro-batch and Apex is more mature from a streaming standpoint.
• Apex– No need to improve core architecture, focus is instead on adding functionality
• Better support for ML• Better support for wide variety of business use cases• Better integration with existing tools
– Stated commitment to letting the community dictate direction. From incubator proposal:• “DataTorrent plans to develop new functionality in an open, community-driven way”
51
Community
• Vendor and community involvement drive roadmap and project growth
• Storm– Limited improvements to core components of Storm in recent months– Limited focused and active committers– Actively promoted and supported in public by Hortonworks
• Flink– Some adoption in Europe, growing response in U.S.– 11 active committers, 10 are from Data Artisans (company behind Flink)– Community is very young, but there is substantial interest
• Apex– Wide support network around Apex due to its evolution alongside Hadoop and YARN– Young but actively growing community: http://incubator.apache.org/projects/apex.html– Opportunity for C1 to drive growth and define the direction of this product
52
Streaming Solutions Comparison
• Apex– Ideal for this use case, meets all performance requirements and is ready for out-of-the-
box enterprise deployment– Committer status from C1 allows us to collaboratively drive roadmap and product
evolution to fit our business need.• Storm
– Great for many streaming use cases but not the right fit for this effort– Performance in failure scenarios does not meet our requirements– Community involvement is waning and there is a limited road map for substantial
product growth• Flink
– Poised to compete with Spark in the future based on community activity and roadmap – Not ready for enterprise deployment:
• Technical limitations around fault-tolerance and failure recovery• Lack of broad community involvement• Roadmap only brings it up to par with existing frameworks
53
New Capabilities Provided by Proposed Architecture
• Millisecond Level Streaming Solution• Fault Tolerant & Highly Available• Parallel Model Scoring for Arbitrary Number of Models• Quick Model Generation & Execution• Dynamic Scalability based on Latency or Throughput• Live Model Refresh• A/B Testing of Models in Production• System is Self Healing upon failure of components (**)
54
Decisioning System Architecture - Strengths
• Internal– Capital One software, running on Capital One hardware, designed by Capital One
• Open source– Internally maintainable code
• Living Model– Can be re-trained on current data & updated in minutes, not years– Offline models can expanded and re-developed and deployed to production at will
• Extensible– Modular architecture with swappable components
• A/B Model Testing in Production• Dynamic Deployment / Refresh of Models
55
Hardware
MDC Hardware Specifications• Server Quantity – 15• Server Model – Supermicro• CPU – Intel Xeon E5-2695v2 2.4Ghz
12Cores • Memory – 256GB• HDD – (5) 4TB Seagate SATA • Network Switch – Cisco Nexus 6001
10GB • NIC – 2port SFP+ 10GbE
MDC Software Specifications• Hadoop – v2.6.0• Yarn – v2.6.0• Apache Apex – v3.0• Linux OS – RHEL v6.7• Linux OS Kernel - 2.6.32-
573.7.1.el6.x86_64
56
Performance Comparison - Redis vs. Apex-HDHT
Apex-HDHT - Thread Local on ~2M eventsStats Percentiles (in ms)
Throughput Count Avg (ms) 90% 95% 99% 99.9% 4 9’s 5 9’s 6 9’s
70k/sec 1,807,283 0.253 1 1 1 2 2 2 2
Apex-HDHT Thread Local on ~54M eventsStats Percentiles (in ms)
Throughput Count Avg (ms) 90% 95% 99% 99.9% 4 9’s 5 9’s 6 9’s
70k/sec 54,126,122 0.19 1 1 1 2 2 5 6
Apex-HDHT No locality on ~2M eventsStats Percentiles (in ms)
Throughput Count Avg (ms) 90% 95% 99% 99.9% 4 9’s 5 9’s 6 9’s
40k/sec 2,214,777 51.651 98 126 381 489 494 495 495
Redis Thread local on ~2M eventsStats Percentiles (in ms)
Throughput Count Avg (ms) 90% 95% 99% 99.9% 4 9’s 5 9’s 6 9’s
8.5k/sec 2,018,057 13.654 16 18 20 21 22 22 22