webinar: faster big data analytics with mongodb

28
CIGNEX Datamatics Confidential www.cignex.com Webinar: Faster Big Data Analytics with MongoDB Case Study: Building Large Scale Data Processing and Data Analysis Platform using MongoDB Date: 06 th April 2016 Speakers: Buzz Moschetti Enterprise Architecture and Special Programs MongoDB Anurag Seth VP, Big Data Analytics & IoT Practice CIGNEX Datamatics

Upload: mongodb

Post on 12-Apr-2017

441 views

Category:

Technology


2 download

TRANSCRIPT

CIGNEX Datamatics Confidential www.cignex.com

Webinar:

Faster Big Data Analytics with MongoDB Case Study: Building Large Scale Data Processing and Data Analysis Platform using MongoDB

Date: 06th April 2016

Speakers: Buzz Moschetti Enterprise Architecture and Special Programs MongoDB Anurag Seth VP, Big Data Analytics & IoT Practice CIGNEX Datamatics

CIGNEX Datamatics Confidential www.cignex.com

Buzz Moschetti,

Enterprise Architecture and Special Programs

MongoDB

Buzz works with F1000 companies to help them design next-generation solutions and develop strategies for overall technology transformation. He is also the CTO of the partner program at MongoDB and a liason to Engineering, Product Management, and Marketing groups.

– 25+ years experience in the field, mostly in financial services as CAO of the Investment Bank at JPMorganChase and Bear Stearns before that

Anurag Seth,

VP, Big Data Analytics & Internet of Things (IoT) Practice, CIGNEX Datamatics

Anurag has unique blend of technology expertise from deep tech VLSI chip design to complex high performance algorithmic software development in EDA (Electronic Design Automation) to embedded system design to predictive modelling & Big Data Analytics deployment for compelling use-cases (including IOT).

– 25 years of strong experience in technology development & delivery – product as well as services across VLSI/EDA, Healthcare , Enterprise Big Data Implementations & IOT

– Has served on board of the VLSI Lab at IIT Kharagpur, been the general chair of the International conference on VLSI Design & Embedded Systems (2009) and still continues to serve on the steering committee of the conference

2

Who are we ?

CIGNEX Datamatics Confidential www.cignex.com

• Big Data Analytics: Opportunity & Challenges

• Case Study: Building Large Scale Data Processing and Analysis Platform using MongoDB

– Business Needs

– Our Approach

– Solution Architecture

– MongoDB - A Great Fit for Data Processing and Analytics

– MongoDB Performance Tuning - Our Holistic Approach

– Recommended Best Practices

• Why MongoDB ?

• Why CIGNEX Datamatics ?

Topics

3

CIGNEX Datamatics Confidential www.cignex.com

Over 88% of data sources and types are not being analyzed..

4

Big Data Analytics: Business Opportunities

Transactional & Application Data

Machine Data

Enterprise Content

Social Data

Reduce Operational Costs

Improved Risk Management

Many

more..

Volume Structured

Velocity Semi-structured

Variety Un-structured

Variety Un-structured

Sensor Data

Velocity Semi-structured

CIGNEX Datamatics Confidential www.cignex.com

The organizations that uses Big Data Analytics to integrate, process and

analyze these data sources are up to 25x more likely to outperform their

competitors.

5

Big Data Analytics: Business Opportunities

Improve Process Efficiency

(Sales, Marketing, Finance, Operations)

Product/Service Innovation

Monetize Information

Improved Collaboration

Improve customer experience

Reduce Operational Costs

Improved Risk Management

CIGNEX Datamatics Confidential www.cignex.com

• Getting the right data & Infra architecture for performance & scalability

• Leverage investments in existing technologies

• Integrating multi-channel & variety of data sources at the modern volume

• Data quality & accuracy challenges

• Big data technologies are evolving too quickly to adapt

• Scarcity of skills and capabilities

6

Big Data Analytics - Implementation to Production Challenges

• Hard ROI from Big Data?

– Identify & monetize existing & new Data Streams

• Turn-around time for big data (predictive modelling) deployments

• Difficult to make big data fit-for-purpose (uncertainty), assess the level of trust, and ensure security & privacy

• Lack of domain centricity

Technical Business

CIGNEX Datamatics Confidential www.cignex.com

Case Study: Building Large Scale Data Processing and Analysis Platform using MongoDB

7

CIGNEX Datamatics Confidential www.cignex.com

• SaaS based sales analytics platform that acquires, processes and enriches accessible public data to deliver data-driven customer and business insights that:

– Enhances efficacy of customer acquisition

– Improve operational efficiency

– Competitive & complementary selling opportunities

– Determine buying propensity, influencers & decision makers

8

Business Need

PUBLIC DATA ACQUISITION SOCIAL LISTENING CUSTOMER/BUSINESS INSIGHTS

CIGNEX Datamatics Confidential www.cignex.com

9

Our Approach

Segment data by influential characteristics as the best variables to use, use case centric

2. DATA PREPARATION

Evaluate and combine multiple models or techniques that lead to higher efficiency

3. MODELING

Dashboard for Big Data Analytics

4. ANALYTICS

Define data sources that could influence the outcome.

1. DATA ACQUISITION

Extensive multi-step rule-based ETL process which involves de-duplication, geo-coding, smart-filtering over huge dataset etc.

Machine Learning ? Augment with ML algorithms in the longer run.

Semantic associations ? Leverage the power of semantic associations (NLP for Entity Extraction, Entity Associations) to process millions of entities & implement complex business rules for data enrichment and refinement

Social listening that integrate 20+ Open public data sources using REST APIs. Store and manage 1billion+ objects expected to be ingested and processed by leveraging elastic scalability of AWS cloud compute

Front-end application with intuitive search/mining and dashboard with graphical visualization of thousands of records with faster response time.

CIGNEX Datamatics Confidential www.cignex.com

10

Solution Architecture (High Level)

Data Processing Data Visualization

Social Data

Market Data

External Data

Location Data

Data Enrichment Data Processing Cluster

Customized Core Java based

ETLs and Java scripts

Third Party ETL Cluster (one of these)

Front-End Application

Full Text Search Engine (one of these)

MongoDB Cluster Customer Data

Amazon Cloud Hosting (Elastic Cloud Computing - EC2)

MongoDB Secondary

MongoDB Primary

MongoDB Secondary

MongoDB Cluster

MongoDB Primary MongoDB Secondary

MongoDB Secondary

Jasper/ Tableau/ C3/D3.js Visualization

Front End Application Framework

CIGNEX Datamatics Confidential www.cignex.com

Requirement MongoDB Features

• Support multiple data processing pipelines

– Via ETL Tool

– Via Custom Code

– Via Custom Scripts

• Integration with leading data integration tools – Alteryx, Talend, Pentaho

• Java Driver to create custom business logic

• Support for server side JavaScript to trigger custom business Logic

• Sustain write throughput with increasing data volumes

• Sharding to scale out horizontally and distribute load

• WiredTiger storage engine (>=Version 3) with features such as document level concurrency facilitating excellent write performance, optimal memory usage, data compression for faster data access and efficient storage

• Provide low latency

• Support large number of concurrent user and sustain response times

• Sharding to route/distribute read requests to separate nodes

• Data & index compression features in in WiredTiger storage engine facilitate better performance

• Store indexes on separate mounts and improve read throughput

11

MongoDB - A Great Fit for Data Processing and Analytics

CIGNEX Datamatics Confidential www.cignex.com 13

Implementation Challenges

Implementation Challenges Solution

• Unifying different Data Processing components(ETL, Custom Code) & overall ETL efficiency

• Created custom / configurable orchestration engine which allows full / partial execution of data processing steps

• Created a dashboard which shows monitoring of the execution steps – allows re-start from anywhere in the multi-step ETL process

• Performance Tuning of Data Processing & Analysis frameworks

• Holistic approach to performance tuning (Covered in detail later)

• Serve different data analysis use cases (Full Text Search, Sub second response times, Persistent Data storage)

• Utilize complimentary technologies

– MongoDB for persistent storage, horizontal scalability, analytics

– Elastic Search or Solr for full text search use cases

• Data Quality • We initially underestimated the extent of quality issues with the data (more so, since most of the data was public). During the execution, we budgeted and hired a dedicated experienced BA who assumed responsibility of data quality & cleaning-up

CIGNEX Datamatics Confidential www.cignex.com

Best Practices

To be successful, you must address your overall design and technology stack, not just schema design.

14

CIGNEX Datamatics Confidential www.cignex.com

A Holistic Approach to MongoDB Performance Tuning

Infrastructure Layer

Storage Engine

Data Model

Query Language

Application Layer

Cluster Sizing & Configuration

• Right Size

• Optimum Price benefit

Replica set sizing, Sharding

Map to use case, R/W Heaviness

Access pattern based Schema

Indexes, Query Tuning

• MongoDB Drivers

• Architecture & Design

15

CIGNEX Datamatics Confidential www.cignex.com

• Infrastructure Sizing:

– SSDs provide VERY SIGNIFICANT performance boost specially for write-heavy workloads

– Investment in CPU with more cores often delivers more benefits than investing in faster CPU

– Ensure that your working-set fits in the RAM (use db.serverStatus() command to view an estimate of the the current working set size)

– Evaluate thoroughly whether journaling is needed. Remember that, with journaling turned on MongoDB ends up using double the RAM.

• Cloud Infrastructure Capacity Planning:

– Leverage cloud platform with the right instance type by evaluating access patterns, workloads & storage requirements.

16

A Holistic Approach to MongoDB Performance Tuning

Future Scalability

Query Tuning

Design Approach,

Schema Design

OS & Storage

Optimisation

Infrastructure Sizing & Capacity Planning

CIGNEX Datamatics Confidential www.cignex.com

• Storage Optimization:

– Recommend use of WiredTiger as storage engine

• OS Optimization:

– Disable NUMA – non uniform memory access- not good for operational database (configure a memory interleave policy )

– Don’t use Huge Pages virtual memory pages – mongo performs better with normal virtual memory pages

– Readahead size should be set to 32 (use the blockdev --setra <value>)

– Increase ulimit (>20,000)

– Turn off atime for the storage volume containing database files

17

A Holistic Approach to MongoDB Performance Tuning

Future Scalability

Query Tuning

Design Approach,

Schema Design

OS & Storage Optimisation

Infrastructure Sizing

& Capacity Planning

CIGNEX Datamatics Confidential www.cignex.com

• Schema Design:

– Always invest time in schema design, dynamic schema only means additional flexibility !!

– Don’t store empty fields in documents

– Create the indexes very carefully. More indexes != more performance. Indexes not fitting not fitting in RAM are often counterproductive for performance

– No Index creation on the FLY

– Index creation in designated “Maintenance Window“

– Use Bulk API feature whenever possible. We have often witnessed significant gains in the write throughput

– Use index optimizations available in the WiredTiger storage engine

18

A Holistic Approach to MongoDB Performance Tuning

Future Scalability

Query Tuning

Design Approach, Schema Design

OS & Storage

Optimisation

Infrastructure Sizing

& Capacity Planning

CIGNEX Datamatics Confidential www.cignex.com

• Scalability:

– Horizontal scaling through sharding

– Use MongoDB aggregation framework

– Always keep the NFRs on top from design to implementation.

• Query Tuning:

– Effective use of indexes to support queries

– Avoid negation in queries & scatter-gather queries

– Reduce query result set size where-ever possible using limit and projections

– Effective & frequent use of MongoDB query profiler & explain command

– Leverage each utility provided by MongoDB - mongoperf, mongosniff, mongostat, mongotop

19

A Holistic Approach to MongoDB Performance Tuning

Future Scalability

Query Tuning

Design Approach,

Schema Design

OS & Storage

Optimisation

Infrastructure Sizing

& Capacity Planning

CIGNEX Datamatics Confidential www.cignex.com

• Simplified solution architecture with the right technologies for the use case

• Performance Tuning & scalability initiated from Day 1

– Holistic approach to performance tuning reduced response times from ~ 2- 3 minutes to ~ 3 -5 seconds

• Proprietary & Open Source can coexist

– Leverage existing investments proprietary tools and Open Source technologies that reduce licensing costs

– Leverage open source java script components for visualization

• Team composition played critical – Need complimentary skills:

– Solution Architecture | Dev-Ops | Business Analysis/Data Science

• Elastic compute storage

– Leverage AWS cloud features of elastic scalability to upsize/downsize compute power based on data processing workloads.

20

Benefits Delivered

CIGNEX Datamatics Confidential www.cignex.com

MongoDB Vital Stats

500+ employees 2000+ customers

Over $311 million in funding

Offices in NY & Palo Alto and

across EMEA, and APAC

21

CIGNEX Datamatics Confidential www.cignex.com

The best way to run MongoDB

Automated.

Supported.

Secured.

Features beyond those in the community edition:

Enterprise-Grade Support

Commercial License

Ops Manager or Cloud Manager Premium

Encrypted & In-Memory Storage Engines

MongoDB Compass

BI Connector (SQL Bridge)

Advanced Security

Platform Certification

On-Demand Training

MongoDB Enterprise Edition

22

CIGNEX Datamatics Confidential www.cignex.com

{ _id: “123”, title: "MongoDB: The Definitive Guide", authors: [ { _id: "kchodorow", name: "Kristina Chodorow“ }, { _id: "mdirold", name: “Mike Dirolf“ } ], published_date: ISODate(”2010-09-24”), pages: 216, language: "English", thumbnail: BinData(0,"AREhMQ=="), publisher: { name: "O’Reilly Media", founded: 1980, locations: ["CA”, ”NY” ] } }

The Data Is The Schema

23

CIGNEX Datamatics Confidential www.cignex.com

> db.authors.find()

{

_id: ”X12",

name: { first: "Kristina”, last: “Chodorow” },

personalData: {

favoritePets: [ “bird”, “dog” ],

awards: [ {name: “Hugo”, when: 1983}, {name: “SSFX”, when: 1992} ]

}

}

{

_id: ”Y45",

name: { first: ”Mike”, last: “Dirolf” } ,

personalData: {

dob: ISODate(“1970-04-05”)

}

}

Treat Your Data More Like Objects

24

CIGNEX Datamatics Confidential www.cignex.com

7x-10x Performance, 50%-80% Less Storage

MongoDB 3.0 Set The Stage…

How: WiredTiger Storage Engine

• Same data model, query language, & ops

• 100% backwards compatible API

• Non-disruptive upgrade

• Storage savings driven by native

compression

• Write performance gains driven by

– Document-level concurrency control

– More efficient use of HW threads

• Much better ability to scale vertically

MongoDB 3.0 MongoDB 2.6

Performance

25

CIGNEX Datamatics Confidential www.cignex.com

MongoDB Sweet Spot Use Cases

Big Data Product & Asset

Catalogs Security &

Fraud Internet of

Things Database-as-a-

Service

Mobile Apps

Customer Data Management Single View

Social & Collaboration

Content Management

Intelligence Agencies

Top Investment and Retail Banks

Top Global Shipping Company

Top Industrial Equipment

Manufacturer

Top Media Company

Top Investment and Retail Banks

Complex Data Management

Top Investment and Retail Banks

Embedded / ISV

Cushman & Wakefield

26

CIGNEX Datamatics Confidential www.cignex.com 27

CIGNEX Datamatics - Established in 2000, USA

12+ Open Source Framework/ Components #1 Pure Play Open

Source Services Company

15 Open Source Books Authored

Global Offices 13+ Business Engagement Platforms 4+

Open Source Community Contributions 5000+ Open Source

Implementations 500+ Open Source Consultants 500+

Portals, Content & Collaboration Portals Enterprise Integration Identity Relationship Management

Enterprise Content Management Document Management Web Content Management Learning/Knowledge Management Imaging and Scanning - OCR/Digitization Enterprise Search Business Process Management

E-Commerce B2B e-Commerce B2C e-Commerce

Internet of Things (IoT) Big Data Analytics Data Integration Information Delivery Data Analysis

Open Source Solutions

Business Engagement Platforms

CIGNEX Datamatics Confidential www.cignex.com 28

At Glance – CIGNEX Datamatics Big Data Analytics & IoT Case Studies

Improve performance through real-time intelligence by efficient device

management. & issue identification

GPS Services Company Networking Company

Increase customer satisfaction & revenue due to uninterrupted video

experience anywhere anytime on any device

Modernization of legacy Quote Portal resulting into competitive advantage –

Quote in 5 minutes

Insurance Company

First mover advantage with timely launch of Sentiment and Trending

Analysis service

SaaS Start-up Company B2B Market Intelligence Services

100% Increase in Conversion Rate with Single View of Business and Market

Intelligence

E-Learning Community Portal

7x-10x Efficient User Data Management with Improved application performance

and data security

CIGNEX Datamatics Confidential www.cignex.com 29

Questions ?

Test Drive Big Data Analytics & IoT Engage us for Proof-of-Concept (PoC)

Website: www.cignex.com | Email: [email protected]