full speed ahead: the briefing room with john myers and mapr

39
Grab some coffee and enjoy the preshow banter before the top of the hour!

Upload: inside-analysis

Post on 14-Apr-2017

268 views

Category:

Technology


0 download

TRANSCRIPT

Grab some coffee and enjoy the pre-­show banter

before the top of the

hour!

The Briefing Room

Full Speed Ahead: Hadoop and Spark for Big Data Applications

Twitter Tag: #briefr The Briefing Room

Welcome

Host: Eric Kavanagh

[email protected] @eric_kavanagh

Twitter Tag: #briefr The Briefing Room

  Reveal the essential characteristics of enterprise software, good and bad

  Provide a forum for detailed analysis of today’s innovative technologies

 Give vendors a chance to explain their product to savvy analysts

  Allow audience members to pose serious questions... and get answers!

Mission

Twitter Tag: #briefr The Briefing Room

Topics

September: HADOOP 2.0

October: DATA MANAGEMENT

November: ANALYTICS

Twitter Tag: #briefr The Briefing Room

The Age of Big Data

Twitter Tag: #briefr The Briefing Room

Analyst: John Myers

John Myers is Managing Research Director at

Enterprise Management Associates

Twitter Tag: #briefr The Briefing Room

MapR

  MapR develops Apache Hadoop-related software

  Its Hadoop distribution boasts data protection, no single point of failure and industry leading performance

  The MapR distribution also features the complete Apache Spark stack, including Spark SQL, Spark Streaming, MLLib and GraphX

Twitter Tag: #briefr The Briefing Room

Guest: Sameer Nori

Sameer Nori is the Senior Product

Marketing Manager for MapR

®© 2015 MapR Technologies 1

®

© 2015 MapR Technologies

Sameer Nori Sep 29, 2015

®© 2015 MapR Technologies 2

Agenda 1.  Customer Requirements

2.  Hadoop ecosystem and The MapR Data Platform

3.  Evolution of SQL-on-Hadoop

4.  Customer Examples

®© 2015 MapR Technologies 3

MapR Architected A Platform For The Age Of Big Data

Apps Databases Operational App platform Storage

1980s 2000s 2010s

Big data apps

RDBMs

SAN/NAS

Monolithic

UNIX Linux

RDBMs

Scale out

Web

Structured Unstructured

Operational Analytics

®© 2015 MapR Technologies 4

What MapR Customers Demand 1.  Efficiency at scale

–  Multi-tenancy: Ability to support multiple teams/projects on one platform –  Resource management: MUST support Hadoop and non-Hadoop workloads

2.  Real-time: MUST support real-time and batch workloads on one cluster

3.  Reliable – Business continuity – must meet SLA’s

4.  Secure – MUST integrate with existing security & data governance standards

5.  Agile - MUST support governed and exploratory BI on one platform

®© 2015 MapR Technologies 5

2004

2006

2009

2011

2013

2015

Architecting for Production Success

MapR in stealth

MapR 5.0 – Extending Real-time beyond Hadoop for Big Data Apps

MapR becomes Hadoop technology leader

MapR-DB – real-time, in-Hadoop DB

Google publishes details of GFS

Hadoop developed at Yahoo!

Built for the enterprise Built for today’s use cases Built for as-it-happens, agile businesses

®© 2015 MapR Technologies 6

The Power of the Open Source Community

®© 2015 MapR Technologies 7

No NameNode architecture

MapReduce/YARN HA

NFS HA

Instant recovery

Rolling upgrades

HA is built in

•  Distributed metadata can self-heal •  No practical limit on # of files

•  Jobs are not impacted by failures •  Meet your data processing SLAs

•  High throughput and resilience for NFS-based data ingestion, import/export and multi-client access

•  Files and tables are accessible within seconds of a node failure or cluster restart

•  Upgrade the software with no downtime

•  No special configuration to enable HA •  All MapR customers operate with HA

High Availability (HA) Everywhere

®© 2015 MapR Technologies 8

Disaster Recovery: Mirroring •  Flexible

–  Choose the volumes/directories to mirror –  You don’t need to mirror the entire cluster –  Any remote cluster can run active volumes

mirrored to other clusters –  Scheduled/incremental to set low RPO –  Promotable mirrors to set low RTO

•  Fast –  No performance impact –  Block-level (8KB) deltas –  Automatic compression

•  Safe –  Point-in-time consistency –  End-to-end checksums

•  Easy –  Graceful handling of network issues –  No third-party software –  Takes less than two minutes to configure!

Production

WAN

Production Research

Datacenter  1   Datacenter  2  

WAN EC2

®© 2015 MapR Technologies 9

Multi-tenancy Isolation •  Tasks sandboxed so they don’t impact other tasks or system daemons •  System resources protected from runaway jobs •  Volume-based data placement •  Label-based job scheduling

Quotas •  Storage quotas by volume/user/group •  CPU and memory quotas by queue/user/group

Security and delegation •  Wire-level authentication and encryption (Kerberos not required) •  Fine-grained administration permissions including volume-level delegation •  Authenticate users to AD, LDAP and Kerberos via Linux PAM

Reporting •  Detailed reporting on resource usage (75+ different metrics) •  All reports are available via UI, CLI and REST API

®© 2015 MapR Technologies 10

1980 2000 2010 1990 2020

Fixed schema

DBA controls structure

Dynamic / Flexible schema Application controls structure

NON-RELATIONAL DATASTORES RELATIONAL DATABASES

GBs-TBs TBs-PBs Volume

Database

Data Increasingly Stored in Non-Relational Datastores

Structure

Development

Structured Structured, semi-structured and unstructured

Planned (release cycle = months-years) Iterative (release cycle = days-weeks)

®© 2015 MapR Technologies 11

Drill’s Role in the Enterprise Data Architecture

Raw data

•  JSON, CSV, ...

“Optimized” data

•  Parquet, …

Centrally-structured data

•  Schemas in Hive Metastore

Relational data

•  Highly-structured data

Hive, Impala, Spark SQL

Oracle, Teradata

Exploration (known and unknown questions)

®© 2015 MapR Technologies 12

Drill is Designed for a Wide Set of Use Cases

Raw Data Exploration JSON Analytics Data Hub Analytics …

Hive HBase Files Directories …

{JSON}, Parquet Text Files …

®© 2015 MapR Technologies 13

Cisco was able to analyze service sales opportunities in 1/10 the time, at 1/10 the cost, and generated $40 million in incremental service bookings in the first year.

Cisco: 360° Customer View Cisco uses integrated customer data to increase revenues

•  Create shared view of customer & operations across 75,000 employees •  Increase revenue opportunities with sales partners

• Customer information was siloed in different divisions • Customer interactions were inconsistent and not satisfying • Missed opportunities for upselling/cross selling

• Use MapR to collect customer information across touch points •  Integrate billing, support, manufacturing, social media, websites, dial-in

data • Generate new sales leads internally and for partners

OBJECTIVES

CHALLENGES

SOLUTION

Architecture for Sales Partner Opportunities

Business Impact

®© 2015 MapR Technologies 14

Cisco Data Platforms Reference Architecture

“The entire market is starting to realize that data is everywhere and an agile ecosystem is paramount. The marketplace demands the flexibility to meet specific needs and decisions are being made based on how well the ecosystem players are integrated.”

Arvind Bedi, Director IT, Cisco Systems

DATABASES

DOSC, CASES, CONTENT, SOCIAL

MEDIA, CLICKSTEAM

Data Storage and Processing

ERP

SFDC SAP HANA ON UCS

AGILE ANALYTICS

MAPR DISTRIBUTION FOR HADOOP

Streaming (Spark

Streaming, Storm)

MapR-DB

MAPR DISTRIBUTION FOR HADOOP

Batch (MR, Spark, Hive, Pig, …)

MapR-FS

BIG DATA PLATFORM

MISSION CRITICAL REPORTING

DATA SECURITY, INFRASTRUCTURE

CUSTOMER NETWORK, PRODUCT USAGE

INTERNET OF EVERYTHING (IoE)

SELF SERVICE DASHBOARD

RAPID BUSINESS MODEL

DATA EXPLORATION

REAL TIME PREDICTIVE

MISSION CRITICAL OPERATIONAL

REPORTS

FINANCIAL REPORTING &

EXTRACT

DATA ANALYSIS, TEXT ANALYTICS

MACHINE LEARNING, STATISTICAL

ANALYSIS

MACHINE DATA INSIGHTS

FINANCIALS STABLE CORE CONTROLLED CHANGE

Network of Trust

MapR Data Platform

Data Consumption Data Sources

ALL Other Sources

Data Bases

(Mobile/ Browser/ Data Service)

Interactive (Drill, Impala)

®© 2015 MapR Technologies 15

“HDFS is great internally, but to get data in and out of Hadoop, you have to do some kind of HDFS export. With MapR, you can just mount [HDFS] as NFS and then use native tools whether they’re in Windows, Unix, Linux or whatever.” - Mike Brown, comScore CTO

comScore: Internet Analytics and Ad Optimization comScore delivers insights about online consumer behavior

•  Provide digital analytics services—syndicated and custom solutions in audience measurement, e-commerce, advertising,video & mobile

•  Keeping up with data. In the past 5 years, comScore’s volume of new data/month has grown from 100 billion to 1.7 trillion records

•  comScore chose MapR for NFS, performance, operational efficiency • MapR processes over 1.7 trillion Internet and mobile records/month,

reaching more than 90% of the Internet population • MapR streaming writes eliminated Cassandra staging cluster cost

OBJECTIVES

CHALLENGES

SOLUTION

Business Impact

®© 2015 MapR Technologies 16

Getting Started with MapR On- Demand Training https://www.mapr.com/training

MapR Sandbox https://www.mapr.com/sandbox

Twitter Tag: #briefr The Briefing Room

Perceptions & Questions

Analyst: John Myers

Importance of Low-Latency in Next Generation Data Management

Slide 11

Disparate Data Sources

Slide 12 © 2015 Enterprise Management Associates, Inc.

Empowering the Line of Business

Slide 13 © 2015 Enterprise Management Associates, Inc.

Latency of Processing

Slide 14 © 2015 Enterprise Management Associates, Inc.

Obstacles Implementing Analytics

Slide 15 © 2015 Enterprise Management Associates, Inc.

Managing Processing Latency

Slide 16 © 2015 Enterprise Management Associates, Inc.

Questions

Slide 17

Discussion Questions

• What sets Apache Drill above other SQL on Hadoop options? There are several either in “development” or available with standard distributions

• How does SPARK work with MapReduce to provide both the “high speed” and the “high capacity?” Many business users “want it all and they want it now”…

© 2015 Enterprise Management Associates, Inc. Slide 18

Discussion Questions

• Without a “structure” or utilizing a variable, multi-structured data sets causes issues for SQL toolsets. How does MapR approach the ingestion of those variable sources before they are “finalized” or during times of flux?

• Continuous data streams are becoming more important as apart of sensor and IoT use cases. How does MapR handle the truly real-time aspects of data ingestion as well as data query?

© 2015 Enterprise Management Associates, Inc. Slide 19

Discussion Questions

• EMA research is showing the growth of data democratization or the penetration of data “work” and decision making in organizations. How many users of MapR environments are business stakeholders vs technologists?

© 2015 Enterprise Management Associates, Inc. Slide 20

Twitter Tag: #briefr The Briefing Room

Twitter Tag: #briefr The Briefing Room

Upcoming Topics

www.insideanalysis.com

September: HADOOP 2.0

October: DATA MANAGEMENT

November: ANALYTICS

Twitter Tag: #briefr The Briefing Room

THANK YOU for your

ATTENTION!

Some images provided courtesy of Wikimedia Commons