![Page 1: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/1.jpg)
End-to-End Security and Auditing in a Big-Data-as-a-Service (BDaaS) Deployment
Abhiraj Butala – BlueDataNanda Vijaydev - BlueData
![Page 2: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/2.jpg)
“A mechanism for the delivery of statistical analysis tools and information that helps organizations understand and use insights gained from large information sets in order to gain a competitive
advantage.”
On-Demand, Self-Service, ElasticBig Data Infrastructure, Applications,
Analytics
Source: www.semantikoz.com/blog/big-data-as-a-service-definition-classification
Big-Data-as-a-Service (BDaaS)
![Page 3: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/3.jpg)
Multi-Tenant Big-Data-as-a-Service
Data/Storage
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance
Data Lake Staging
Multiple compute services (Hadoop, BI, Spark)
There is a shared Data Lake (Shared HDFS)
![Page 4: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/4.jpg)
Why BDaaS? – Compute Side Of The Story
• Set of applications that interact with Hadoop keeps growing
• Various versions of the same app/distro run in parallel
• Enterprises have need to scale compute up and down based on usage
• A model similar to Amazon AWS with S3 as storage and applications on EC2
![Page 5: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/5.jpg)
Why BDaaS? – Data Side Of The Story
• Production cluster access takes time and is generally restricted
• Staging clusters may not have all the data• Data exists on other storage systems such
as NFS Isilon is common• Users also want to upload arbitrary files
for analysis
![Page 6: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/6.jpg)
Hadoop – A Collection Of Services
Hadoop is a collection of storage and compute services such as HDFS, HBase, Hive, Yarn, Solr, Kafka
![Page 7: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/7.jpg)
Security In Hadoop • Authenticate user into Hadoop ecosystem
– Each service has its own integration with LDAP/AD for authentication
• Authorize and limit their actions to selected services. Authorization is granted separately for each service. Example:– Folder “/user/customer” in HDFS has ‘r-x’ to user ‘alice’, and ‘-
wx’ to user ‘bob’– Enable column level access to a Hive Table. “Customer.Name”
& “Customer.PhoneNumber” is only accessible by some users and groups
![Page 8: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/8.jpg)
Ranger – A Pluggable Security Framework
• Ranger works with a common user DB (LDAP/AD) for authentication • Provides a plug-in for individual Hadoop services to enable
authorization• Allows users to define policies in a central location, using WEB UI or
APIs• Users can define their own plug-in for a custom service and manage
them centrally via Ranger Admin
![Page 9: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/9.jpg)
Defining HDFS Ranger Policies
HDFS Policy List
Marketing Policy Drill Down
![Page 10: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/10.jpg)
Security Considerations in BDaaS
Data/Storage
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance
Data Lake Staging 1. User Identity – Data Lake
2. User Identity - Application Level
3. User Identity propagation to Data Layer
1. User identity within a Data Lake
2. User identity in application layer
3. Prevent data duplication & maintain user integrity across layers
![Page 11: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/11.jpg)
1. Securing The Data Lake
LDAPKDCData/Storage
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance
Data Lake Staging 1. Authentication & Authorization – Data Lake
2. User Identity - Application Level
3. User Identity propagation to Data Layer
![Page 12: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/12.jpg)
2. Securing The App Layer
LDAP
KDCData/Storage
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance
Data Lake Staging 1. Authentication & Authorization – Data Lake
2. User Identity - Application Level
3. User Identity propagation to Data Layer
App containers are integrated with LDAP
KDC
AliceBob Tom
![Page 13: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/13.jpg)
3. Identity Propagation to Data Layer
LDAP
KDCData/Storage
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance
Data Lake Staging 1. Authentication & Authorization – Data Lake
2. User Identity - Application Level
3. User Identity propagation to Data Layer
KDC
AliceBob Tom
![Page 14: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/14.jpg)
User Identity Propagation
Two Ways–Users connect directly to HDFS
• Simple Authentication• Kerberos Authentication
–Users connect to HDFS via a Super-user (Impersonation)
![Page 15: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/15.jpg)
HDFS Direct Connections
LDAP
KDC
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance
KDC
Alice BobTom
HDFSData Lake
![Page 16: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/16.jpg)
HDFS Direct Connections..
– hdfs-audit.log
– Ranger policies are enforced for alice and bob as they are the effective users
![Page 17: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/17.jpg)
HDFS Direct Connections..
• Single Hadoop Setup– Ideal
• Multi-tenant, Multi-application Setup– Kerberized HDFS needs kerberized compute and services– May not want to kerberize Dev/QA setups– Hadoop versions should be compatible all across– Data duplication
![Page 18: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/18.jpg)
HDFS Super-user Connections
• Super-users perform actions on behalf of other users (Impersonation/Proxying)
• Adding a new super-user is easy– core-site.xml
![Page 19: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/19.jpg)
HDFS Super-user Connections..
LDAP
KDC
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance
KDC
Alice BobTom
HDFSData Lake
DataTap Caching Servicevia – super-user
![Page 20: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/20.jpg)
HDFS Super-user Connections..
– hdfs-audit.log
– Ranger Authorization policies still enforced, as alice and bob are effective users
![Page 21: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/21.jpg)
HDFS Super-user Connections..
Multi-tenant, Multi-application Setup– Works for applications which don’t support Kerberos (yet)– Dev/Test setups need not be kerberized– DataTap service can abstract version incompatibilities– Can help avoid data duplication– Need tight LDAP/AD integration though!
![Page 22: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/22.jpg)
Ranger in Action
Hue Example
![Page 23: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/23.jpg)
HDFS Permissions on Data Lake
• Set HDFS file access for ‘/user/secret’ to strict mode
• Set umask to ‘077’
![Page 24: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/24.jpg)
HDFS Ranger Policies
![Page 25: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/25.jpg)
DataTap Caching Service
![Page 26: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/26.jpg)
Create Table via Hue
![Page 27: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/27.jpg)
Query table via Hue - Success
![Page 28: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/28.jpg)
Query table via Hue - Failure
![Page 29: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/29.jpg)
Ranger Audit Logs
![Page 30: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/30.jpg)
Key Takeaways
• BDaaS is more than Hadoop-as-a-Service– Includes BI / ETL / Analytics + Data Science tools
• Security is an important consideration in BDaaS• Data duplication is not an option• Global user authentication using a centralized DB like
LDAP/AD is a must• Apache Ranger helps in enforcing global policies,
provided user identities are propagated correctly
![Page 31: Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Summit 2016](https://reader035.vdocuments.mx/reader035/viewer/2022070603/586f77f51a28ab10258b69a3/html5/thumbnails/31.jpg)
Q & A
www.bluedata.com
Nanda Vijaydev@nandavijaydev
Abhiraj Butala@abhirajbutala