hadoop security in big-data-as-a-service deployments - presented at hadoop summit 2016
TRANSCRIPT
End-to-End Security and Auditing in a Big-Data-as-a-Service (BDaaS) Deployment
Abhiraj Butala – BlueDataNanda Vijaydev - BlueData
“A mechanism for the delivery of statistical analysis tools and information that helps organizations understand and use insights gained from large information sets in order to gain a competitive
advantage.”
On-Demand, Self-Service, ElasticBig Data Infrastructure, Applications,
Analytics
Source: www.semantikoz.com/blog/big-data-as-a-service-definition-classification
Big-Data-as-a-Service (BDaaS)
Multi-Tenant Big-Data-as-a-Service
Data/Storage
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance
Data Lake Staging
Multiple compute services (Hadoop, BI, Spark)
There is a shared Data Lake (Shared HDFS)
Why BDaaS? – Compute Side Of The Story
• Set of applications that interact with Hadoop keeps growing
• Various versions of the same app/distro run in parallel
• Enterprises have need to scale compute up and down based on usage
• A model similar to Amazon AWS with S3 as storage and applications on EC2
Why BDaaS? – Data Side Of The Story
• Production cluster access takes time and is generally restricted
• Staging clusters may not have all the data• Data exists on other storage systems such
as NFS Isilon is common• Users also want to upload arbitrary files
for analysis
Hadoop – A Collection Of Services
Hadoop is a collection of storage and compute services such as HDFS, HBase, Hive, Yarn, Solr, Kafka
Security In Hadoop • Authenticate user into Hadoop ecosystem
– Each service has its own integration with LDAP/AD for authentication
• Authorize and limit their actions to selected services. Authorization is granted separately for each service. Example:– Folder “/user/customer” in HDFS has ‘r-x’ to user ‘alice’, and ‘-
wx’ to user ‘bob’– Enable column level access to a Hive Table. “Customer.Name”
& “Customer.PhoneNumber” is only accessible by some users and groups
Ranger – A Pluggable Security Framework
• Ranger works with a common user DB (LDAP/AD) for authentication • Provides a plug-in for individual Hadoop services to enable
authorization• Allows users to define policies in a central location, using WEB UI or
APIs• Users can define their own plug-in for a custom service and manage
them centrally via Ranger Admin
Defining HDFS Ranger Policies
HDFS Policy List
Marketing Policy Drill Down
Security Considerations in BDaaS
Data/Storage
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance
Data Lake Staging 1. User Identity – Data Lake
2. User Identity - Application Level
3. User Identity propagation to Data Layer
1. User identity within a Data Lake
2. User identity in application layer
3. Prevent data duplication & maintain user integrity across layers
1. Securing The Data Lake
LDAPKDCData/Storage
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance
Data Lake Staging 1. Authentication & Authorization – Data Lake
2. User Identity - Application Level
3. User Identity propagation to Data Layer
2. Securing The App Layer
LDAP
KDCData/Storage
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance
Data Lake Staging 1. Authentication & Authorization – Data Lake
2. User Identity - Application Level
3. User Identity propagation to Data Layer
App containers are integrated with LDAP
KDC
AliceBob Tom
3. Identity Propagation to Data Layer
LDAP
KDCData/Storage
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance
Data Lake Staging 1. Authentication & Authorization – Data Lake
2. User Identity - Application Level
3. User Identity propagation to Data Layer
KDC
AliceBob Tom
User Identity Propagation
Two Ways–Users connect directly to HDFS
• Simple Authentication• Kerberos Authentication
–Users connect to HDFS via a Super-user (Impersonation)
HDFS Direct Connections
LDAP
KDC
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance
KDC
Alice BobTom
HDFSData Lake
HDFS Direct Connections..
– hdfs-audit.log
– Ranger policies are enforced for alice and bob as they are the effective users
HDFS Direct Connections..
• Single Hadoop Setup– Ideal
• Multi-tenant, Multi-application Setup– Kerberized HDFS needs kerberized compute and services– May not want to kerberize Dev/QA setups– Hadoop versions should be compatible all across– Data duplication
HDFS Super-user Connections
• Super-users perform actions on behalf of other users (Impersonation/Proxying)
• Adding a new super-user is easy– core-site.xml
HDFS Super-user Connections..
LDAP
KDC
Prod
2.2
Dev/Test
2.4
POC
2.3
Prod
2.3
Dev/Test
2.4
MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance
KDC
Alice BobTom
HDFSData Lake
DataTap Caching Servicevia – super-user
HDFS Super-user Connections..
– hdfs-audit.log
– Ranger Authorization policies still enforced, as alice and bob are effective users
HDFS Super-user Connections..
Multi-tenant, Multi-application Setup– Works for applications which don’t support Kerberos (yet)– Dev/Test setups need not be kerberized– DataTap service can abstract version incompatibilities– Can help avoid data duplication– Need tight LDAP/AD integration though!
Ranger in Action
Hue Example
HDFS Permissions on Data Lake
• Set HDFS file access for ‘/user/secret’ to strict mode
• Set umask to ‘077’
HDFS Ranger Policies
DataTap Caching Service
Create Table via Hue
Query table via Hue - Success
Query table via Hue - Failure
Ranger Audit Logs
Key Takeaways
• BDaaS is more than Hadoop-as-a-Service– Includes BI / ETL / Analytics + Data Science tools
• Security is an important consideration in BDaaS• Data duplication is not an option• Global user authentication using a centralized DB like
LDAP/AD is a must• Apache Ranger helps in enforcing global policies,
provided user identities are propagated correctly
Q & A
www.bluedata.com
Nanda Vijaydev@nandavijaydev
Abhiraj Butala@abhirajbutala