jim dowling – interactive flink analytics with hopsworks and zeppelin
TRANSCRIPT
![Page 1: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/1.jpg)
Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling
Ermias Gebermeskel
www.hops.io@hopshadoop
![Page 2: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/2.jpg)
Marketing 101: Celebrity Endorsements
*Turing Award Winner 2014, Father of Distributed Systems
Hi!
I’m Leslie Lamport* and
even though you’re not
using Paxos, I approve
this product.
![Page 3: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/3.jpg)
Talk Overview
•Multi-tenancy in Hadoop
•Multi-tenancy in HopsWorks
•Free-Text Search of Hadoop Metadata in HopsWorks
•Zeppelin and Flink in HopsWorks
3
![Page 4: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/4.jpg)
Goal: Multi-Tenancy and Data Sharing
4
Project NSA
Project X
No Unauthorized Copying/Cross-Linking of Data
DataSetowns
authorize
access
![Page 5: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/5.jpg)
Access Control in Relational Databases
# How do we provide multi-tenancy for users alice and bob using two databases db1 and db2?
grant all privileges on db1.* to ‘alice'@‘%‘;
grant all privileges on db2.* to ‘bob'@‘%‘;
#More fine-grained privileges
grant SELECT privileges on db2.sensitiveTable
to ‘alice'@‘192.168.1.2‘;
5
What happens to the privileges if I call “drop table db2.sensitiveTable”?
![Page 6: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/6.jpg)
Access Control in Hadoop: Apache Sentry
6How do you ensure the consistency of the policies and the data?
[Mujumdar’15]
![Page 7: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/7.jpg)
Policy Editor for Sentry
7
![Page 8: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/8.jpg)
Performance of Policy Enforcement Points (PEP)
8*https://docs.wso2.com/display/IS500/XACML+Performance+in+the+Identity+Server
![Page 9: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/9.jpg)
PEPs + Hadoop = Horse-Drawn Sportscar
9
Policy Enforcement Engines ≈ O(2,000) ops/sec
HopsFS Distributed Filesystem ≈ O(100,000) ops/sec
Horse-Drawn Sportscar
![Page 10: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/10.jpg)
HopsWorks
10
![Page 11: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/11.jpg)
Users, DataSets, and Projects
In-Place Data Sharing - not Copying!
DataSet2DataSet1 DataSet3
Project 1 Project 2 Project 3
![Page 12: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/12.jpg)
User
•Authentication Provider
- JDBC Realm
- 2-Factor Authentication
- LDAP
12
![Page 13: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/13.jpg)
Project
•Members
- Roles: Owner, Data Scientist
•DataSets
- Home project
- Can be shared
13
![Page 14: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/14.jpg)
Project Roles
•Owner Privileges
- Import/Export data
- Manage Membership
- Share DataSets
•Data Scientist Privileges
- Write code
- Run code
- Request access to DataSets
14
We delegate administration of privileges to users
![Page 15: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/15.jpg)
Sharing DataSets between Projects
16
The same as Sharing Folders in Dropbox
![Page 16: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/16.jpg)
Delegate Access Control to HDFS
•HDFS enforces access control
•Convention for directories
•Hadoop and HopsWorksuse the same Users and Groups in a common DB
•UserId per Project
•GroupId per Project and DataSet
17
With Hadoop metadata in a DB, we guarantee policy integrity with Foreign Keys
![Page 17: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/17.jpg)
Engine – HopsFS, HopsYARN
18
![Page 18: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/18.jpg)
HopsFS
19
Stateless NameNodes
NDB
Leader
HopsWorks
DataNodes
J2EE Server
HopsWorks
J2EE Server
Metadata & policies
![Page 19: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/19.jpg)
HopsYARN
20
ResourceMgrs
NDB
Scheduler
NodeManagers
Resource Trackers
HopsWorks
J2EE Server
HopsWorks
J2EE Server
Metadata & policies
![Page 20: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/20.jpg)
Data Abstraction Layer (DAL)
21
NameNode
(Apache v2)
DAL API
(Apache v2)
NDB-DAL-Impl
(GPL v2)
Other Impl
(Other License)
hops-2.4.0.jar dal-ndb-2.4.0-7.4.7.jar
ResourceMgr
(Apache v2)
![Page 21: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/21.jpg)
Hops Performance
22
![Page 22: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/22.jpg)
HopsFS Metadata Scaleout
23Assuming 256MB Block Size, 100 GB JVM Heap for Apache Hadoop
![Page 23: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/23.jpg)
HopsFS Throughput (Real Workload)
24Experiments performed on AWS EC2 with enhanced networking and C3.8xLarge instances
![Page 24: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/24.jpg)
What else can we do with metadata in a DB?
25
![Page 25: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/25.jpg)
How ACME Inc. handles Free-Text Search
26
HDFS
In Theory
Unified Search and Update API
In Practice
Inconsistent Metadata
![Page 26: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/26.jpg)
Global Search: Projects and DataSets
27
![Page 27: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/27.jpg)
Project Search: Files, Directories
28
![Page 28: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/28.jpg)
Design your own Extended Metadata
29
![Page 29: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/29.jpg)
MetaData Entry
30
![Page 30: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/30.jpg)
Free Text Search with Consistent Metadata
31
Free-Text Search
Distributed Database
ElasticSearch
The Distributed Database is the Single Source of Truth.
Foreign keys ensure the integrity of Metadata.
MetaDataDesigner
MetaDataEntry
![Page 31: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/31.jpg)
Flink and Zeppelin in HopsWorks
32
![Page 32: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/32.jpg)
Batch Job Analytics
33
![Page 33: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/33.jpg)
Interactive Analytics: Flink on Zeppelin
![Page 34: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/34.jpg)
Other Features
•Audit Logs
•Erasure Coding Replication
•Online upgrade of Hops (and NDB)
•Automated Installation with Karamel
•Tinker friendly – easy to extend metadata!
35
![Page 35: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/35.jpg)
Conclusions
•Hops is a next-generation distribution of Hadoop.
•HopsWorks is a frontend to Hops that supports true multi-tenancy, free-text search, interactive analytics with Zeppelin/Flink/Spark, and batch jobs.
•Looking for contributors/committers
- Pick-me-up on GitHub
36
www.hops.io
![Page 36: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/36.jpg)
The Team
Academics: Jim Dowling, Seif Haridi
PostDocs: Gautier Berthou
PhDs: Salman Niazi, Mahmoud Ismail, Kamal Hakimzadeh
MSc Students:K.Srijeyanthan “Sri”, Evangelos Savvidis, Seçkin Savaşçı, Ermias Gebremeskel
Alumini: Steffen Grohsschmiedt , Theofilos Kakantousis, Stig Viaene, Andre Moré, Qi Qi, Alberto Lorente, Hooman Peiro, Jude D’Souza, Nikolaos Stanogias, Daniel Bali, Ioannis Kirkinos,Peter Buechler, Pushparaj Motamari, Hamid Afzali,Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.
37
![Page 38: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/38.jpg)
HDFS v2 Architecture
39
DataNodes
HDFS Client
Journal Nodes Zookeeper
Snapshot
NodeNameNode Standby
NameNode
Active-Standby Replication of NN Log
Agreement on the Active NameNode
Faster Recovery - Cut the NN Log
Doesn’t Scale Out
![Page 39: Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin](https://reader031.vdocuments.mx/reader031/viewer/2022030305/587138481a28abf0568b6311/html5/thumbnails/39.jpg)
YARN Architecture
40
NodeManagers
YARN Client
Zookeeper
ResourceMgr Standby
ResourceMgr
1. Master-Slave Replication of RM State
2. Agreement on the Active ResourceMgr