Hadoop AWS infrastructure cost evaluation

Download Hadoop AWS infrastructure cost evaluation

Post on 04-Dec-2014




2 download

Embed Size (px)


How to calculate the cost of a Hadoop infrastructure on Amazon AWS, given some data volume estimates and the rough use case ? Presentation attempts to compare the different options available on AWS.


<ul><li> 1. Hadoop Platform infrastructure cost evaluation</li></ul> <p> 2. Agenda High level requirements Cloud architecture Major architecture components Amazon AWS Hadoop distributions Capacity Planning Amazon AWS EMR Hadoop distributions On-premise hardware costs Gotchas 2 3. High Level Requirements Build an Analytical &amp; BI platform for web log analytics Ingest multiple data sources: Log data internal user data Apply complex business rules Manage Events, filter Crawler Driven Logs, apply Industry and Domain Specific rules Populate/export to a BI tool for visualization. 3 4. Non-Functional Requirements Todays baseline: ~42 TB per year (~ 3.5TB raw data per month), 3 years store SLA: Should process data every day. Currently done once a month. Predefined processing via Hive; no exploratory analysis Everything in the cloud: Store (HDFS), Compute (M/R), Analysis (BI tool) 4 5. Non-Functional Requirements [2] Seeding data in S3 (3 years data worth) Adding monthly net-new data only. Speed not of primary importance5 6. Data Estimates for Capacity planning [2] Cleaned-up log data per year 42 TB (3 years = 126 TB) Total disk space required should consider Compression (LZO 40%) Reduces disk space required to ~25 * Replication Factor of 3 : ~75 TB 75% disk utilization maximum in Hadoop: 100TB Total disk capacity required for DN: ~100TB / year (17.5TB/ mo) (*disclaimer: depends on codec and data input)6 7. Data Estimates for Capacity planning: reduced logs Expected Data data Log data After compression Replication 70% disk utilization volume volume (TB) (Gzip 40%) on 3 nodes maximum (TB) 1 month 3.6 2.16 6.5 1 year 42 25 759.2 1073 years32212675.6226 Total disk capacity required for DN: ~10TB/ month 7 8. Cloud Solution Architecture 2. Export data to HDFSAmazon AWS3. Process in M/RHive Tables BI ToolHadoopS3HDFS1. Copy data to S3ClientLogs4. Display in BI toolMetadata ExtractionWebservers8User5. Retain results into S3 9. Hadoop on AWS: EC2 Amazon Elastic Compute Cloud (EC2) is a web service that provides resizable compute capacity in the cloud. Manual set up of Hadoop on EC2 Use EBS for storage capacity (HDFS) Storage on S39 10. Running Hadoop on AWS: EC2 EC2 instances options Choose instance type Choose instance type availability Choose instance family Choose where the data resides: S3 high latency, but highly available EBS Permanent storage? Snapshots to S3? Apache Whirr for set up 10 11. Amazon EC2 Instance features Other choices: EBS-optimized instances: dedicated throughput between Amazon EC2 and Amazon EBS, with options between 500 Mbps and 1000 Mbps depending on the instance type used. Inter-region data transfer Dedicated instances: run on single-tenant hardware dedicated to a single customer. Spot instances: Name your price 11 12. Amazon Instance Families Amazon EC2 instances are grouped into six families: General purpose, Memory optimized, Compute optimized, Storage optimized, micro and GPU.General-purpose instances have memory to CPU ratios suitable for most general purpose apps.Memory-optimized instances offer larger memory sizes for high throughput applications.Compute-optimized instances have proportionally more CPU resources than memory (RAM) and are well suited for compute-intensive applications.Storage-optimized instances are optimized for very high random I/O performance , or very high storage density, low storage cost, and high sequential I/O performance.micro instances provide a small amount of CPU with the ability to burst to higher amounts for brief periods.GPU instances, for dynamic applications.Data nodes12 13. Amazon Instances types availability On-Demand Instances On-Demand Instances let you pay for compute capacity by the hour with no long-term commitments. This frees you from the costs and complexities of planning, purchasing, and maintaining.Reserved Instances Reserved Instances give you the option to make a onetime payment for each instance you want to reserve and in turn receive a discount on the hourly charge for that instance. There are three Reserved Instance types (Light, Medium, and Heavy Utilization Reserved Instances) that enable you to balance the amount you pay upfront with your effective hourly price.Spot Instances Spot Instances allow customers to bid on unused Amazon EC2 capacity and run those instances for as long as their bid exceeds the current Spot Price. The Spot Price changes periodically based on supply and demand, and customers whose bids meet or exceed it gain access to the available Spot Instances. If you have flexibility in when your applications can run, Spot Instances can significantly lower your Amazon EC2 costs.13 14. Amazon EC2 Storage14 15. Amazon EC2 Instance typesData nodesBI instances Master nodes 15 16. Systems Architecture EC2 AWS Hadoop NNSNDNsENClient LogsHDFS on EBS drivesS3 BI NodeNodeNodeBI Hadoop cluster is initiated when analytics is run Data is streamed from S3 to EBS Volumes Results from analytics stored to S3 once computed16 BI nodes permanentNode 17. Hadoop on AWS: EC2 Probably not the best choice: EBS volumes make the solution costly If instead using instance storage, choices of EC2 instances either too small (a few Gigs) or too big (48 TB/per instance). Dont need the flexibility just want to use Hive17 18. Hadoop on AWS: EMR EC2 Amazon Elastic MapReduce ( EMR) is a web service that provides a hosted Hadoop framework running on the EC2 and Amazon Simple Storage Service (S3). 18 19. Running Hadoop on AWS - EMR Elastic Map ReduceFor occasional jobs Ephemeral clustersEase of use, but 20% costlierData stored in S3 - Highly tuned for S3 storageHive and Pig availableOnly pay for S3 + instances time while jobs runningOr: leave it always on.19 20. Hadoop on AWS - EMR EC2 instances with own flavor of Hadoop Amazon Apache Hadoop is 1.0.3 version. You can also choose MapR M3 or M5 (0.20.205) version. You can run Hive (0.7.1 or 0.8.1), Custom JAR, Streaming, Pig or Hbase.20 21. Systems Architecture EMR AWSHadoop EMR DNs SNNNClient LogsHDFS from S3S3 BI Instanc eInstanceInstanceBI Hadoop cluster created elastically Data is streamed from S3 to initiate Hadoop cluster dynamically Results from analytics stored to S3 once computed BI nodes permanentInstance21 22. Amazon EMR Instance typesData nodesBI instances Master nodes 22 23. AWS calculator EMR calculation Calculate and add: S3 cost (seeded data) Incremental S3 cost, per month EC2 cost EMR cost In/out Transfer of data cost Amazon support cost Infrastructure support Engineer cost 23 24. AWS calculator EMR calculation Say for 24hrs/day, EMR cost:24 25. AWS calculator EMR calculation Say for 24hrs/day, 3 year S3:25 26. AWS calculator EMR calculation Say for 24hrs/day, 3 year EC2:26 27. Amazon EMR Pricing Reduced log volume Data volume (in year)Instances typesPrice/year Running 24 hours/dayPrice/year Running 8 hours/dayPrice/year Running 8 hours/wee k1 year storing 42TB on S310 instances Data nodes: m1.xlarge NN: m2.2xlarge BI: m2.2xlarge Load balancer: t1.micro 1 year reserved 10 EMR instances (Subject to change depending on actual load)$14.1k/mo * 12 = $169.2k$8.9k * 12= $106k$6.6k * 12 = $79.2k$19.5k *36 mos = $684k$15.5k * 36 mos = $558k$13.2k * 36 mos = $4753 years storing 126TB on S327 28. Hadoop on AWS: trade-offs FeatureEC2EMREase of useHard IT Ops costsEasy; Hadoop clusters can be of any size; canhave multiple clusters. CostCheaperCostlier: pay for EC2 + EMRFlexibilityBetter: Access to full stack of Hadoop ecosystemPortabilityEasier to move to dedicated hardwareSpeedFasterLower performance: all data is streamed from S3 for each jobMaintabilityCan choose any vendor; Can be updated to latest versoin;Debugging tricky: cluster terminated, no logsOn demand Hadoop cluster: Ease of use Hadoop installed, but with limited options28 29. EC2 Pricing Gotchas EMR with Spot instances seems to be the trend for minimal cost, if SLA timeliness is not of primary importance. Use reserved instances to bring down cost drastically (60%). Compression on S3 ? Need to account for secondary NN? Ability to estimate better how many EMR nodes are needed with AWSs AMI task configuration 29 30. EMR Technical Gotchas Transferring data between S3 and EMR clusters is very fast (and free), so long as your S3 bucket and Hadoop cluster are in the same Amazon region EMRS3 File System streams data directly to S3 instead of buffering to intermediate local files. EMRS3 File System adds Multipart Upload, which splits your writes into smaller chunks and uploads them in parallel. Store fewer, larger files instead of many smaller ones 30http://blog.mortardata.com/post/58920122308/s3-hadoop-performance 31. In house Hadoop cluster Data volume (in year)Storage for Data nodesInstancesPrice, first year126TB6*12x2TB10 data nodes, 3 Master$10.6k * 6 DN + $7.3k * 3 = $128kDell PowerEdge R720: Processor E5-2640 2.50GHz, 8 cores, 12M Cache,Turbo, Memory 64GB Memory, Quad Ranked RDIMM for 2 Processors, Low Volt Hard Drives 12 - 2TB 7.2K RPM SATA 3.5in Hot Plug Hard Drive Network Card Intel 82599 Dual Port 10GE Mezzanine Card BI4 nodes+ Vendor Support ($50k) + Full-time person ($150k) = $328k$43k31 32. Licensing and support costs32 33. Hadoop Distributions: Cloudera or Hortonworks Enterprise 24X7 Production Support - phone and support portal access(Support Datasheet Attached) Minimum $50k$33 34. Amazon Support EC2 &amp; EMR BusinessEnterpriseResponse Time : 1 Hour Access: Phone, Chat and Email 24/7Response Time: 15 minutes Access: Phone, Chat, TAM and Email 24/7Costs Greater of $100 - or 10% of monthly AWS usage for the first $0-$10K 7% of monthly AWS usage from $10K$80K 5% of monthly AWS usage from $80K$250K 3% of monthly AWS usage from $250K+ (about $800/yr)http://aws.amazon.com/premiumsupport/Costs Greater of $15,000 - or 10% of monthly AWS usage for the first $0-$150K 7% of monthly AWS usage from $150K$500K 5% of monthly AWS usage from $500K$1M 3% of monthly AWS usage from $1M+34 35. Thank You35 </p>


View more >