webinar: faster big data analytics with mongodb
TRANSCRIPT
CIGNEX Datamatics Confidential www.cignex.com
Webinar:
Faster Big Data Analytics with MongoDB Case Study: Building Large Scale Data Processing and Data Analysis Platform using MongoDB
Date: 06th April 2016
Speakers: Buzz Moschetti Enterprise Architecture and Special Programs MongoDB Anurag Seth VP, Big Data Analytics & IoT Practice CIGNEX Datamatics
CIGNEX Datamatics Confidential www.cignex.com
Buzz Moschetti,
Enterprise Architecture and Special Programs
MongoDB
Buzz works with F1000 companies to help them design next-generation solutions and develop strategies for overall technology transformation. He is also the CTO of the partner program at MongoDB and a liason to Engineering, Product Management, and Marketing groups.
– 25+ years experience in the field, mostly in financial services as CAO of the Investment Bank at JPMorganChase and Bear Stearns before that
Anurag Seth,
VP, Big Data Analytics & Internet of Things (IoT) Practice, CIGNEX Datamatics
Anurag has unique blend of technology expertise from deep tech VLSI chip design to complex high performance algorithmic software development in EDA (Electronic Design Automation) to embedded system design to predictive modelling & Big Data Analytics deployment for compelling use-cases (including IOT).
– 25 years of strong experience in technology development & delivery – product as well as services across VLSI/EDA, Healthcare , Enterprise Big Data Implementations & IOT
– Has served on board of the VLSI Lab at IIT Kharagpur, been the general chair of the International conference on VLSI Design & Embedded Systems (2009) and still continues to serve on the steering committee of the conference
2
Who are we ?
CIGNEX Datamatics Confidential www.cignex.com
• Big Data Analytics: Opportunity & Challenges
• Case Study: Building Large Scale Data Processing and Analysis Platform using MongoDB
– Business Needs
– Our Approach
– Solution Architecture
– MongoDB - A Great Fit for Data Processing and Analytics
– MongoDB Performance Tuning - Our Holistic Approach
– Recommended Best Practices
• Why MongoDB ?
• Why CIGNEX Datamatics ?
Topics
3
CIGNEX Datamatics Confidential www.cignex.com
Over 88% of data sources and types are not being analyzed..
4
Big Data Analytics: Business Opportunities
Transactional & Application Data
Machine Data
Enterprise Content
Social Data
Reduce Operational Costs
Improved Risk Management
Many
more..
Volume Structured
Velocity Semi-structured
Variety Un-structured
Variety Un-structured
Sensor Data
Velocity Semi-structured
CIGNEX Datamatics Confidential www.cignex.com
The organizations that uses Big Data Analytics to integrate, process and
analyze these data sources are up to 25x more likely to outperform their
competitors.
5
Big Data Analytics: Business Opportunities
Improve Process Efficiency
(Sales, Marketing, Finance, Operations)
Product/Service Innovation
Monetize Information
Improved Collaboration
Improve customer experience
Reduce Operational Costs
Improved Risk Management
CIGNEX Datamatics Confidential www.cignex.com
• Getting the right data & Infra architecture for performance & scalability
• Leverage investments in existing technologies
• Integrating multi-channel & variety of data sources at the modern volume
• Data quality & accuracy challenges
• Big data technologies are evolving too quickly to adapt
• Scarcity of skills and capabilities
6
Big Data Analytics - Implementation to Production Challenges
• Hard ROI from Big Data?
– Identify & monetize existing & new Data Streams
• Turn-around time for big data (predictive modelling) deployments
• Difficult to make big data fit-for-purpose (uncertainty), assess the level of trust, and ensure security & privacy
• Lack of domain centricity
Technical Business
CIGNEX Datamatics Confidential www.cignex.com
Case Study: Building Large Scale Data Processing and Analysis Platform using MongoDB
7
CIGNEX Datamatics Confidential www.cignex.com
• SaaS based sales analytics platform that acquires, processes and enriches accessible public data to deliver data-driven customer and business insights that:
– Enhances efficacy of customer acquisition
– Improve operational efficiency
– Competitive & complementary selling opportunities
– Determine buying propensity, influencers & decision makers
8
Business Need
PUBLIC DATA ACQUISITION SOCIAL LISTENING CUSTOMER/BUSINESS INSIGHTS
CIGNEX Datamatics Confidential www.cignex.com
9
Our Approach
Segment data by influential characteristics as the best variables to use, use case centric
2. DATA PREPARATION
Evaluate and combine multiple models or techniques that lead to higher efficiency
3. MODELING
Dashboard for Big Data Analytics
4. ANALYTICS
Define data sources that could influence the outcome.
1. DATA ACQUISITION
Extensive multi-step rule-based ETL process which involves de-duplication, geo-coding, smart-filtering over huge dataset etc.
Machine Learning ? Augment with ML algorithms in the longer run.
Semantic associations ? Leverage the power of semantic associations (NLP for Entity Extraction, Entity Associations) to process millions of entities & implement complex business rules for data enrichment and refinement
Social listening that integrate 20+ Open public data sources using REST APIs. Store and manage 1billion+ objects expected to be ingested and processed by leveraging elastic scalability of AWS cloud compute
Front-end application with intuitive search/mining and dashboard with graphical visualization of thousands of records with faster response time.
CIGNEX Datamatics Confidential www.cignex.com
10
Solution Architecture (High Level)
Data Processing Data Visualization
Social Data
Market Data
External Data
Location Data
Data Enrichment Data Processing Cluster
Customized Core Java based
ETLs and Java scripts
Third Party ETL Cluster (one of these)
Front-End Application
Full Text Search Engine (one of these)
MongoDB Cluster Customer Data
Amazon Cloud Hosting (Elastic Cloud Computing - EC2)
MongoDB Secondary
MongoDB Primary
MongoDB Secondary
MongoDB Cluster
MongoDB Primary MongoDB Secondary
MongoDB Secondary
Jasper/ Tableau/ C3/D3.js Visualization
Front End Application Framework
CIGNEX Datamatics Confidential www.cignex.com
Requirement MongoDB Features
• Support multiple data processing pipelines
– Via ETL Tool
– Via Custom Code
– Via Custom Scripts
• Integration with leading data integration tools – Alteryx, Talend, Pentaho
• Java Driver to create custom business logic
• Support for server side JavaScript to trigger custom business Logic
• Sustain write throughput with increasing data volumes
• Sharding to scale out horizontally and distribute load
• WiredTiger storage engine (>=Version 3) with features such as document level concurrency facilitating excellent write performance, optimal memory usage, data compression for faster data access and efficient storage
• Provide low latency
• Support large number of concurrent user and sustain response times
• Sharding to route/distribute read requests to separate nodes
• Data & index compression features in in WiredTiger storage engine facilitate better performance
• Store indexes on separate mounts and improve read throughput
11
MongoDB - A Great Fit for Data Processing and Analytics
CIGNEX Datamatics Confidential www.cignex.com 13
Implementation Challenges
Implementation Challenges Solution
• Unifying different Data Processing components(ETL, Custom Code) & overall ETL efficiency
• Created custom / configurable orchestration engine which allows full / partial execution of data processing steps
• Created a dashboard which shows monitoring of the execution steps – allows re-start from anywhere in the multi-step ETL process
• Performance Tuning of Data Processing & Analysis frameworks
• Holistic approach to performance tuning (Covered in detail later)
• Serve different data analysis use cases (Full Text Search, Sub second response times, Persistent Data storage)
• Utilize complimentary technologies
– MongoDB for persistent storage, horizontal scalability, analytics
– Elastic Search or Solr for full text search use cases
• Data Quality • We initially underestimated the extent of quality issues with the data (more so, since most of the data was public). During the execution, we budgeted and hired a dedicated experienced BA who assumed responsibility of data quality & cleaning-up
CIGNEX Datamatics Confidential www.cignex.com
Best Practices
To be successful, you must address your overall design and technology stack, not just schema design.
14
CIGNEX Datamatics Confidential www.cignex.com
A Holistic Approach to MongoDB Performance Tuning
Infrastructure Layer
Storage Engine
Data Model
Query Language
Application Layer
Cluster Sizing & Configuration
• Right Size
• Optimum Price benefit
Replica set sizing, Sharding
Map to use case, R/W Heaviness
Access pattern based Schema
Indexes, Query Tuning
• MongoDB Drivers
• Architecture & Design
15
CIGNEX Datamatics Confidential www.cignex.com
• Infrastructure Sizing:
– SSDs provide VERY SIGNIFICANT performance boost specially for write-heavy workloads
– Investment in CPU with more cores often delivers more benefits than investing in faster CPU
– Ensure that your working-set fits in the RAM (use db.serverStatus() command to view an estimate of the the current working set size)
– Evaluate thoroughly whether journaling is needed. Remember that, with journaling turned on MongoDB ends up using double the RAM.
• Cloud Infrastructure Capacity Planning:
– Leverage cloud platform with the right instance type by evaluating access patterns, workloads & storage requirements.
16
A Holistic Approach to MongoDB Performance Tuning
Future Scalability
Query Tuning
Design Approach,
Schema Design
OS & Storage
Optimisation
Infrastructure Sizing & Capacity Planning
CIGNEX Datamatics Confidential www.cignex.com
• Storage Optimization:
– Recommend use of WiredTiger as storage engine
• OS Optimization:
– Disable NUMA – non uniform memory access- not good for operational database (configure a memory interleave policy )
– Don’t use Huge Pages virtual memory pages – mongo performs better with normal virtual memory pages
– Readahead size should be set to 32 (use the blockdev --setra <value>)
– Increase ulimit (>20,000)
– Turn off atime for the storage volume containing database files
17
A Holistic Approach to MongoDB Performance Tuning
Future Scalability
Query Tuning
Design Approach,
Schema Design
OS & Storage Optimisation
Infrastructure Sizing
& Capacity Planning
CIGNEX Datamatics Confidential www.cignex.com
• Schema Design:
– Always invest time in schema design, dynamic schema only means additional flexibility !!
– Don’t store empty fields in documents
– Create the indexes very carefully. More indexes != more performance. Indexes not fitting not fitting in RAM are often counterproductive for performance
– No Index creation on the FLY
– Index creation in designated “Maintenance Window“
– Use Bulk API feature whenever possible. We have often witnessed significant gains in the write throughput
– Use index optimizations available in the WiredTiger storage engine
18
A Holistic Approach to MongoDB Performance Tuning
Future Scalability
Query Tuning
Design Approach, Schema Design
OS & Storage
Optimisation
Infrastructure Sizing
& Capacity Planning
CIGNEX Datamatics Confidential www.cignex.com
• Scalability:
– Horizontal scaling through sharding
– Use MongoDB aggregation framework
– Always keep the NFRs on top from design to implementation.
• Query Tuning:
– Effective use of indexes to support queries
– Avoid negation in queries & scatter-gather queries
– Reduce query result set size where-ever possible using limit and projections
– Effective & frequent use of MongoDB query profiler & explain command
– Leverage each utility provided by MongoDB - mongoperf, mongosniff, mongostat, mongotop
19
A Holistic Approach to MongoDB Performance Tuning
Future Scalability
Query Tuning
Design Approach,
Schema Design
OS & Storage
Optimisation
Infrastructure Sizing
& Capacity Planning
CIGNEX Datamatics Confidential www.cignex.com
• Simplified solution architecture with the right technologies for the use case
• Performance Tuning & scalability initiated from Day 1
– Holistic approach to performance tuning reduced response times from ~ 2- 3 minutes to ~ 3 -5 seconds
• Proprietary & Open Source can coexist
– Leverage existing investments proprietary tools and Open Source technologies that reduce licensing costs
– Leverage open source java script components for visualization
• Team composition played critical – Need complimentary skills:
– Solution Architecture | Dev-Ops | Business Analysis/Data Science
• Elastic compute storage
– Leverage AWS cloud features of elastic scalability to upsize/downsize compute power based on data processing workloads.
20
Benefits Delivered
CIGNEX Datamatics Confidential www.cignex.com
MongoDB Vital Stats
500+ employees 2000+ customers
Over $311 million in funding
Offices in NY & Palo Alto and
across EMEA, and APAC
21
CIGNEX Datamatics Confidential www.cignex.com
The best way to run MongoDB
Automated.
Supported.
Secured.
Features beyond those in the community edition:
Enterprise-Grade Support
Commercial License
Ops Manager or Cloud Manager Premium
Encrypted & In-Memory Storage Engines
MongoDB Compass
BI Connector (SQL Bridge)
Advanced Security
Platform Certification
On-Demand Training
MongoDB Enterprise Edition
22
CIGNEX Datamatics Confidential www.cignex.com
{ _id: “123”, title: "MongoDB: The Definitive Guide", authors: [ { _id: "kchodorow", name: "Kristina Chodorow“ }, { _id: "mdirold", name: “Mike Dirolf“ } ], published_date: ISODate(”2010-09-24”), pages: 216, language: "English", thumbnail: BinData(0,"AREhMQ=="), publisher: { name: "O’Reilly Media", founded: 1980, locations: ["CA”, ”NY” ] } }
The Data Is The Schema
23
CIGNEX Datamatics Confidential www.cignex.com
> db.authors.find()
{
_id: ”X12",
name: { first: "Kristina”, last: “Chodorow” },
personalData: {
favoritePets: [ “bird”, “dog” ],
awards: [ {name: “Hugo”, when: 1983}, {name: “SSFX”, when: 1992} ]
}
}
{
_id: ”Y45",
name: { first: ”Mike”, last: “Dirolf” } ,
personalData: {
dob: ISODate(“1970-04-05”)
}
}
Treat Your Data More Like Objects
24
CIGNEX Datamatics Confidential www.cignex.com
7x-10x Performance, 50%-80% Less Storage
MongoDB 3.0 Set The Stage…
How: WiredTiger Storage Engine
• Same data model, query language, & ops
• 100% backwards compatible API
• Non-disruptive upgrade
• Storage savings driven by native
compression
• Write performance gains driven by
– Document-level concurrency control
– More efficient use of HW threads
• Much better ability to scale vertically
MongoDB 3.0 MongoDB 2.6
Performance
25
CIGNEX Datamatics Confidential www.cignex.com
MongoDB Sweet Spot Use Cases
Big Data Product & Asset
Catalogs Security &
Fraud Internet of
Things Database-as-a-
Service
Mobile Apps
Customer Data Management Single View
Social & Collaboration
Content Management
Intelligence Agencies
Top Investment and Retail Banks
Top Global Shipping Company
Top Industrial Equipment
Manufacturer
Top Media Company
Top Investment and Retail Banks
Complex Data Management
Top Investment and Retail Banks
Embedded / ISV
Cushman & Wakefield
26
CIGNEX Datamatics Confidential www.cignex.com 27
CIGNEX Datamatics - Established in 2000, USA
12+ Open Source Framework/ Components #1 Pure Play Open
Source Services Company
15 Open Source Books Authored
Global Offices 13+ Business Engagement Platforms 4+
Open Source Community Contributions 5000+ Open Source
Implementations 500+ Open Source Consultants 500+
Portals, Content & Collaboration Portals Enterprise Integration Identity Relationship Management
Enterprise Content Management Document Management Web Content Management Learning/Knowledge Management Imaging and Scanning - OCR/Digitization Enterprise Search Business Process Management
E-Commerce B2B e-Commerce B2C e-Commerce
Internet of Things (IoT) Big Data Analytics Data Integration Information Delivery Data Analysis
Open Source Solutions
Business Engagement Platforms
CIGNEX Datamatics Confidential www.cignex.com 28
At Glance – CIGNEX Datamatics Big Data Analytics & IoT Case Studies
Improve performance through real-time intelligence by efficient device
management. & issue identification
GPS Services Company Networking Company
Increase customer satisfaction & revenue due to uninterrupted video
experience anywhere anytime on any device
Modernization of legacy Quote Portal resulting into competitive advantage –
Quote in 5 minutes
Insurance Company
First mover advantage with timely launch of Sentiment and Trending
Analysis service
SaaS Start-up Company B2B Market Intelligence Services
100% Increase in Conversion Rate with Single View of Business and Market
Intelligence
E-Learning Community Portal
7x-10x Efficient User Data Management with Improved application performance
and data security
CIGNEX Datamatics Confidential www.cignex.com 29
Questions ?
Test Drive Big Data Analytics & IoT Engage us for Proof-of-Concept (PoC)
Website: www.cignex.com | Email: [email protected]