balazs gaspar | solutions engineer gergely devenyi ... · spark, hive, impala, hbase, ... sdx...
TRANSCRIPT
INTRODUCING CLOUDERADATA PLATFORMGergely Devenyi | Director of Engineering
Balazs Gaspar | Solutions Engineer
2 © Cloudera, Inc. All rights reserved.
© 2019 Cloudera, Inc. All rights reserved. 3
• 85CountriesCustomers
3,000+Employees
2,000+Partners
3,000+
8/10TOP
GLOBAL
10/10TOP
GLOBAL
9/10TOP
GLOBAL
40+GOVERNMENT CUSTOMERS
BANKING TELCO PHARMAPUBLIC
8/10TOP
GLOBAL
TECHNOLOGY
10/10TOP
GLOBAL
AUTOMOTIVE
THE NEW CLOUDERA
© 2019 Cloudera, Inc. All rights reserved. 4
OUR CUSTOMERS ARE ASKING FOR
Hybrid Deployments• Move data and applications
without rewriting and retraining
• Separate data management strategy from infrastructure strategy
• Manage all environments from a single pane of glass
Multi-Function & Open• Deploy one platform to
address current and future workload needs
• Connect disparate workload types to develop Edge2AI applications on one platform
• Open source and open APIs
Secure & Governed• Manage data security and
governance centrally• Automate application
security at all layers• Reduce time to value with
enterprise-grade productivity tools
Customer Experience• Easy to use with self-serve
capabilities• Elasticity and agility to meet
changing demands of workloads and company
• Simple to manage and maintain environments and applications
© 2019 Cloudera, Inc. All rights reserved. 5
HOW TIMES HAVE CHANGED
2008SCALE 1 JOB TO
1000s OF SERVERS
2019SCALE 1 PLATFORM TO
1000s OF USERS
© 2019 Cloudera, Inc. All rights reserved. 6
DATA TEAMS ARE HIGHLY SPECIALIZED
App DevelopersData Engineers
Compliance OfficersData Architects
BI Analysts Data Scientists
Infrastructure Managers
© 2019 Cloudera, Inc. All rights reserved. 7
App DevelopersData Engineers
Compliance OfficersData Architects
BI Analysts Data Scientists
Infrastructure Managers
SPECIALIZATION CREATES A DIVERSITY OF NEEDS
Continuous availability, custom tooling
Capacity guarantees to enable consistent SLAs
Capacity on demand to support bursty workloads
Latest tools and hardware, ad-hoc resources
Seamlessly integrated data landscape
Fine-grain access controls, privacy and verifiable audit
Reliability, cost, & scale, fault tolerance
© 2019 Cloudera, Inc. All rights reserved. 8
A DATA PLATFORM DESIGNED FOR MULTI-TENANCY
Cloudera Data Platform
SDX
App Developers
Data ArchitectsCompliance Mgrs. Infrastructure Mgrs.
Centralized Data, Security, Governance and Management
Data Engineers BI Analysts Data Scientists
CUSTOM ENVIRONMENTS CUSTOM ENVIRONMENTS CUSTOM ENVIRONMENTS CUSTOM ENVIRONMENTS
Confidential — Restricted 9
OUR APPROACH
© 2019 Cloudera, Inc. All rights reserved. 10
CLOUDERA DATA PLATFORM
© 2019 Cloudera, Inc. All rights reserved. 11
CDP HOME
A single login to access the full platform, documentation, and support - all controlled through corporate SSO
© 2019 Cloudera, Inc. All rights reserved. 12
DATAHUB
A familiar and highly customizable cluster service optimized for the separation of storage and compute
DataEngineers
AppDevelopers
© 2019 Cloudera, Inc. All rights reserved. 13
DATA WAREHOUSE
A data warehousing service optimized for concurrency, caching, and isolation
BI Analysts
© 2019 Cloudera, Inc. All rights reserved. 14
A machine learning workspace service to connect teams of data scientists to enterprise data
MACHINE LEARNING
Data Scientists
© 2019 Cloudera, Inc. All rights reserved. 15
WORKLOAD MANAGER
A centralized management tool for analyzing and optimizing workloads within and across environments
DataEngineers BI Analysts
© 2019 Cloudera, Inc. All rights reserved. 16
REPLICATION MANAGER
A centralized management tool for replicating and migrating data, metadata, and policies between environments
Data Architects
© 2019 Cloudera, Inc. All rights reserved. 17
DATACATALOG
A centralized data stewardship tool for searching, organizing, securing, and governing data across environments
Compliance Officers
© 2019 Cloudera, Inc. All rights reserved. 18
A single pane of glass to manage 100s of clusters all with different lifecycles - across multiple environments
MANAGEMENT CONSOLE
Infrastructure Managers
INSIDE LOOK INTO CDP
© 2019 Cloudera, Inc. All rights reserved. 20
CDP ENABLES INFRASTRUCTURE AGNOSTIC DEPLOYMENTS
CDP Data Center(monocluster, bare metal, no containers)
Spark, Hive, Impala, HBase, ...
SDX(backed by HDFS)
CDP Private Cloud(separate storage / compute, containers)
SDX(backed by HDFS / Ozone)
DataHub(on VMs)
CDW(on K8s)
CML(on K8s)
CDP Public Cloud(separate storage / compute, containers)
SDX(backed by S3 / ADLS / GCS)
DataHub(on VMs)
CDW(on K8s)
CML(on K8s)
CDP Management Console
Data Catalog Workload Manager Replication Manager
© 2019 Cloudera, Inc. All rights reserved. 21
CDP HIGH LEVEL ARCHITECTURE
Management ConsoleManagement Console - A single pane of glass to manage one or more environments and the services that run within each environment
Environment
SDX
Data Hub
Clusters
DWClusters
MLClusters
DataHubClusters
CDWClusters
CMLClusters
Environment - A logical encapsulation of a customer network and the the services that run within that network (like an Azure virtual network)
Cluster – A distributed computing service that running on VMs (Data Hub) or K8s (the experiences) and has access the shared data lake
SDX – The data access control layer that sits on top of the backend object store and provides coherent data security and governance for all the applications running with the environment
Data Catalog Workload Manager
Replication Manager
© 2019 Cloudera, Inc. All rights reserved. 22
CLOUDERA DATA PLATFORM
© 2019 Cloudera, Inc. All rights reserved. 23
Maintaining lineage requires integrating changes across disparate systems, to determine data origin and track data throughout its lifecycle.
DATA LINEAGE
To effectively manage data, it is critical to define a common point of reference, a system-of-record to ensure data quality, consistency, and integrity across all applications.
MASTER DATA MANAGEMENT
To explore data classification, audit information, and metadata, navigation paths need to be in place for running multiple queries across multiple data types.
SEARCH AND INDEX
Archiving least frequently used data to streamline systems allows the data lake to stay performant by reducing the volume of unused data.
DATA RETENTION AND ARCHIVAL
There is a need to understand the data and ensure prescribed data quality rules are applied.
DATA QUALITY AND PROFILING
Security around the data supply chain process requires access control, agreed upon tokenization or encryption standards, and monitoring and alerting systems.
SECURITY AND ACCESS CONTROL
Metadata, the attributes gathered from data, has to be integrated into a repository and maintained.
METADATA MANAGEMENT
Auditing needs to be carried out to account for the data and ensure users are compliant across multiple environments.
AUDITINGA business glossary provides a common vocabulary and standardization to data definitions which facilitates communication across teams.
BUSINESS DEFINITIONSThe resources who are involved in maintaining data need a framework to govern processes and workflows among various governance roles
GOVERNANCE
DATA MANAGEMENT FOR BUILDING TRUSTED DATA LAKES
© 2019 Cloudera, Inc. All rights reserved. 24
CDP DATA CENTER – POWERED BY CLOUDERA RUNTIME
New features for CDH 6 customers
Ranger
• Dynamic row filtering• Dynamic column masking• Attribute-based access control• SparkSQL fine-grained access control
Atlas 2.0• Advanced data discovery• Improved performance and scalability
Hive 3 • Better fit for EDW Optimization use cases (large joins, analytical style workloads)
Knox • Gateway-based SSO
Hive on Tez • Better ETL performance
New features for HDP 3 customers
Cloudera Manager
• Virtual private clusters• Automated wire encryption setup• Fine-grained RBAC for administrators• Streamlined maintenance workflows
Atlas 2.0• Advanced data lineage• Faceted search
Impala • Better fit for Data Mart migration use cases (interactive, BI style queries)
Hue • Built-in SQL editor
Kudu • Better performance for fast changing / updateable data
Includes SDX and many other important capabilities
25Confidential — Restricted
WHEN CAN I GET IT?
CDP PUBLIC CLOUD
AWS (Q3) / EKS
AZURE / AKS
GCP / GKE
CDP DATA CENTER
CONTINUITY
“CDP BARE METAL”
CLOUDERA RUNTIME
CDP PRIVATE CLOUD
KUBERNETES-BASED
“BATTERIES INCLUDED”
3rd PARTY K8 DISTROS
Q3 Q4 2020
HDP 2.x / 3.x
CDH 5.x / 6.x+
EDGE TO AI
27Confidential — Restricted
CLOUDERA DATAFLOW DATA-IN-MOTION PLATFORM
28
CLOUDERA MACHINE LEARNINGAccelerate and simplify machine learning from research to production
ANALYZE DATA• Explore data securely and
share insights with the team
TRAIN MODELS• Run, track, and compare
reproducible experiments
DEPLOY APIs• Deploy and monitor
models as APIs to serve predictions
MANAGE SHARED RESOURCES• Provide a secure, collaborative, self-service platform for your data science teams
29
WHAT INDUSTRIALIZED MACHINE LEARNING LOOKS LIKE
Predictive Services
BI Tools and SQL Editors
Data Products
DATA, METADATA, SECURITY, GOVERNANCE, WORKLOAD MANAGEMENT
MACHINE LEARNING
DATA ENGINEERING
DATAWAREHOUSE
OPERATIONAL DATABASE
Sensors/IoT Devices
DATA FLOW & STREAMING
© 2019 Cloudera, Inc. All rights reserved. 30
DRIVING INNOVATION IN STREAMING
THANK YOU