hadoop distributed file system for the grid -
Post on 09-Feb-2022
6 Views
Preview:
TRANSCRIPT
1
Roadmap for Applying Hadoop Distributed File System in Scientific Grid Computing
Garhan Attebury1, Andrew Baranovski2, Ken Bloom1, Brian Bockelman1,
Dorian Kcira3, James Letts4, Tanya Levshina2, Carl Lundestedt1, Terrence
Martin4, Will Maier5, Haifeng Pi 4, Abhishek Rana4, Igor Sfiligoi4, Alexander
Sim6, Michael Thomas3, Frank Wuerthwein4
1. University of Nebraska Lincoln 2. Fermi National Accelerator Laboratory 3. California Institute of Technology 4. University of California, San Diego 5. University of Wisconsin Madson 6. Lawrence Berkeley National Laboratory
On Behalf of Open Science Grid (OSG) Storage Hadoop Community
2
Storage, a critical component of Grid
• Grid computing is data-intensive and CPU-intensive, which requires– Scalable management system for bookkeeping and discovering data– Reliable and fast tools for distributing and replicating data– Efficient procedures for processing and extracting data– Advanced techniques for analyzing and storing data in parallel
• A scalable, dynamic, efficient and easy-to-maintain storage system is on the critical path to the success of grid computing– Meet various data access needs in both organization and individual level– Maximize the CPU usage and efficiency– Fit into sophisticated VO policies (e.g. Data security, user privilege )– Survive the “unexpected” usage of storage system– Minimize the cost of ownership– Easy to expand, reconfigure, commission/decommission as requirement
changes
3
A Case Study, Some Requirements for Storage Element (SE) at Compact Muon Solenoid (CMS)
• Have a credible support model that meets the reliability, availability, and security expectations consistent with the computing infrastructure
• Demonstrate the ability to interface with the existing global data transfer system and the transfer technology of SRM tools and FTS as well as demonstrate the ability to interface to the CMS software locally through ROOT
• Well-defined and reliable behavior for recovery from the failure of any hardware components.
• Well-defined and reliable method of replicating files to protect against the loss of any individual hardware system
• Well-defined and reliable procedure for decommissioning hardware without data loss
• Well-defined and reliable procedure for site operators to regularly check the integrity of all files in the SE
• Well-defined interfaces to monitoring systems• Capable of delivering at least 1 MB/s/batch slot for CMS applications, capable
of writing files from the WAN at a performance of at least 125MB/s while simultaneously writing data from the local farm at an average rate of 20MB/s.
• Failures of jobs due to failure to open the file or deliver the data products from the storage systems should be at the level of less than 1 in 105 level.
4
Hadoop Distributed File System (HDFS)
• Open source project hosted by Apache (http://hadoop.apache.org) and used by YAHOO for its search engine with multiple-PB scale of data involved
• Design goal– reduce the impact of hardware failure– Stream data access– handle large datasets– Simple coherency model– Portability across heterogeneous platforms
• A scalable distributed cluster file system– The namespace and image of the whole file system is maintained in one single
machine's memory, NameNode– The files are split into blocks and stored across the cluster, DataNode– File blocks can be replicated. Loss of one DataNode can be recovered from the replica
blocks in other DataNodes.
5
Important Components of HDFS-based SE• Fuse/Fuse-DFS
– A linux kernel module, allows file systems to be written in userspace and POSIX-like interface to HDFS
– Important for the software application accessing data in the local SE
• Globus GridFTP– provide WAN transfer between to SE(s) or SE and workernode (WN). – A special plugin is needed to assemble asynchronous transfered packets for
sequential writing to the HDFS if multiple streams are used
• BeStMan– provide SRM interface to the HDFS– Possible to develop/implement plugins to select GridFTP servers according to the
status of the GridFTP servers
A number of software bugs and integration issues have been solved for the last 12 months to really bring all the components together and make a production quality SE
6
HDFS SE Architecture for Scientific Computing
WorkerNode + (DataNode) +(GridFTP)
FUSE + Hadoop Client
WorkerNode + (DataNode) +(GridFTP)
FUSE + Hadoop Client
WorkerNode + (DataNode) +(GridFTP)
FUSE + Hadoop Client
WorkerNode + (DataNode) +(GridFTP)
FUSE + Hadoop Client
GridFTP Node
FUSE + Hadoop Client
GridFTP Node
FUSE + Hadoop Client
Dedicated Data Node
Hadoop Client
Dedicated Data Node
Hadoop ClientNameNode
(secondary NN)
BeStMan
Fuse + Hadoop Client
GUMSProxy-User Mapping
7
HDFS-based SE at CMS Tier-2
• Currently three CMS Tier-2 sites, Nebraska, Caltech and UCSD, deployed HDFS-based SE
– Average 6-12 months operation experience with increasing scale in total disk space
– Currently around 100 DataNodes ranging from 300 to 500 TB for each site – Successfully serve the CMS collaboration with up to thousands of grid users and
hundreds of local users to access the dataset in HDFS– Successfully serve the data operation and Monte Carlo production run by the CMS
• What benefits the new SE brings to these sites – Reliability: stop loss of files because of a decent file replica schemes run by HDFS– Simple deployment: most of the deployment procedure is streamlined with fewer
commands done by the administrators– Easy operation: stable system, little effort for system/file recovery, less than 30 min
for daily operation and user support– Proved scalability for supporting a large number of simultaneous Read/Write
operation and high throughput for serving the data for grid jobs running at the site
8
Highlight of Operational Performance of HDFS-SE
• Stably deliver ~3MB/s to applications in the cluster while the cluster is fully loaded with jobs
– Sufficient for CMS application's requirement on I/O with high CPU efficiency– CMS application is IOPS limited, not bandwidth limited
• HDFS NameNode serves 2500 user request per second– Sufficient for a cluster with thousand of cores with I/O intensive jobs
• Sustained WAN transfer rate 400MB/s– Sufficient for CMS Tier-2 data operation (dataset transfer and stage-out of user
analysis jobs)
• Simultaneously processing thousand client's request at BeStMan– Sustained endpoint processing rate 50 Hz– Sufficient for high-rate transfers of gigabytes-sized files and uncontrolled chaotic
user jobs
• Observed extremely low file corruption rate– Benefit from robust and fast file replication of HDFS
• Decommissioning of a DataNode < 1 hour, restart NameNode in 1 minute, check the image of file system (from memory of NameNode) in 10 sec
– Fast and efficient for the operation
• Survive various stress test that involves HDFS, BeStMan, GridFTP ...
9
Data Transfer to HDFS-SE
10
NameNode Operation Count
11
Processing Rate at SRM endpoint
12
Monitoring and Routine Test
• Integration with general grid monitoring infrastructure– Nagious, Ganglia, MonALISA– CPU, memory, network statistics for the NameNode, DataNode and the whole
system
• HDFS monitoring– Hadoop web service, Hadoop Chronicle, Jconsole
• Status of the file system and user– Logs of NameNode, DataNode and GridFTP, BeStMan
• As part of the daily tasks and debugging activities
• Regular low-stress test performed by CMS VO– Test analysis jobs, load test of file transfer – Part of the daily commission of the site involves local and remote I/O of the SE
• Intentional failure in various parts of the SE with demonstrated recovery mechanism
– Documentation of recovery procedure
13
Load test between two HDFS-SE
14
Data Security and Integrity
• Security concerns– HDFS
• No encryption or strong authentication between client and server. HDFS must only be exposed to a secure internal network
• Practically firewall or NAT is needed to properly isolated the HDFS from direct “public” access
• Latest HDFS implements access token. Transition to kerberos-based components is expected in 2010.
– Grid components (GridFTP and BeStMan) • Use standard GSI security with VOMS extensions
• Data integrity and consistency of the file system– HDFS Checksum for blocks of data– Command line tool to check block, directory and file– HDFS keeps multiple journal and file system image– NameNode periodically requests the entire block report from all
DataNode.
15
A Combined Release Infrastructure at OSG and CMS
• Various original open sources provide all the necessary packages– HDFS, FUSE, BeStMan, GridFTP plugins, BeStMan plugins ...
• All software components needed for deploying the hadoop-based SE are packaged as RPM
– with add-on configuration and scripts necessarily to a site to install with minimal changes according to the site condition and requirement
• Consistency check and validation are done in selected sites with HDFS-SE experts before the formal release via OSG
– a testbed for common platforms and scalability test
• Development in 2010– Release procedure to be fully integrated into standard OSG distribution: Virtual
Data Toolkit (VDT)– Possibility of some intersection with external commercial packagers, e.g., using
selected RPMs from Cloudera
16
Site Specific Optimization
Various optimization can be done for each site based on the usage patterns and local hardware condition
• Block size for files• Number of file replicas• Architecture of GridFTP server deployment
– A few high performance GridFTP vs. many GridFTP running at the WorkerNode
• Memory allocation at WorkerNode (WN) for GridFTP, application ...• Selection of GridFTP servers
– Real-time-monitoring-based GridFTP selection base on CPU and memory usage vs. randomly picking alive GridFTP
• Data access with MapReduce– A special case for data processing
• Rack awareness
17
Summary of Our Experience• Hadoop-based storage solution is established and functioning at CMS
tier-2 level as an example of data- and CPU-intensive HPC – Flexible in the architecture involving various grid components– Scalable and stable– Seamlessly interfaced with various grid middleware
• Lower costs in deployment, maintenance, and required hardware– Significantly reduce manpower and increase QoS– Easy to adapt to existing/new hardware and changing requirements– Standard release for the whole community– Experts available to help solve the technical problems
• VO and grid sites benefit from reliable HDFS file replica and distribution scheme
– High data security and integrity– Excellent I/O performance for CPU- and data-intensive grid applications– Less administrator intervention
HFDS is shown to be seamlessly integrated into a grid storage solution for a Virtual Organization (VO) or grid site
18
Roadmap for the Near Future
• Deployment in a varieties of scientific computing projects, or experiments, or institutions
– As a integrated storage element solution– As a storage file system
• Benchmark performance for HPC with data- and CPU-intensive grid computing
– Scalability, Stability, Usability– Integration and efficiency with other tools
• Organization– Seamless integration between scientific user community and HDFS development
community– Consolidation of scientific release and technical support
• New development and contribution from scientific community– Funding proposal based on HDFS infrastructure and technology– Improvement in I/O Capacity and full integration as a critical component of Storage
Element– Operational optimization with different scales of data and compute infrastructure
top related