aws summit tel aviv - enterprise track - data warehouse

Download AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Post on 08-Sep-2014

1.052 views

Category:

Technology

1 download

Embed Size (px)

DESCRIPTION

 

TRANSCRIPT

  • AWS Summit 2013 Tel Aviv Oct 16 Tel Aviv, Israel Data Warehouse on AWS Guy Ernest Solutions Architecture, Amazon Web Services
  • ERP CRM ANALYST DATAWAREHOUSE DB
  • OLTP ERP OLAP OLTP CRM ANALYST DATAWAREHOUSE OLTP DB
  • Transactional Processing Analytical Processing Transactional context Global context Latency Throughput Indexed access Full table scans Random IO Sequential IO Disk seek times Disk transfer rate
  • OLTP OLAP
  • BUSINESS INTELLIGENCE REPORTS, DASHBOARD, PRODUCTION OFFLOAD DIFFERENT DATA STRUCTURE, USING ETLs, ANALYST DATAWAREHOUSE
  • BIG ENTREPRISES VERY EXPENSIVE (ROI) DIFFICULT TO MAINTAIN NOT SCALABLE
  • BIG ENTREPRISES SME VERY EXPENSIVE (ROI) DIFFICULT TO MAINTAIN NOT SCALABLE WAY TOO EXPENSIVE !
  • Jeff Bezos
  • Data Sources Value Queries
  • + ELASTIC CAPACITY + NO CAPEX + PAY FOR WHAT YOU USE + DISPOSE ON DEMAND = NO CONTRAINTS
  • ACCELERATION COLLECT STORE ANALYZE SHARE AMAZON REDSHIFT
  • AMAZON REDSHIFT
  • DWH that scales to petabyte and WAY SIMPLER AMAZON REDSHIFT WAY FASTER WAY LESS EXPENSIVE
  • AMAZON REDSHIFT RUNNING ON OPTIMIZED HARDWARE HS1.8XL: 128 GB RAM, 16 Cores, 16 TB Compressed Data, 2 GB/sec Disk Scan HS1.XL: 16 GB RAM, 2 Cores, 2 TB Compressed Data
  • Extra Large Node (HS1.XL) Single Node (2 TB) Cluster 2-32 Nodes (4 TB 64 TB) Eight Extra Large Node (HS1.8XL) Cluster 2-100 Nodes (32 TB 1.6 PB)
  • JDBC/ODBC 10 GigE (HPC) Ingestion Backup Restoration
  • WAY SIMPLER
  • LOADING DATA Parallel Loading Data sorted and distributed automatically Linear Growth
  • DATA SNAPSHOTS Automatic and Incremental snapshots in Amazon S3 Configurable Retention Period Manual Snapshots Streaming Restore
  • REPLICATION IN CLUSTER + AUTOMATIC SNAPSHOT IN AMAZON S3 + MONITORING OF CLUSTER NODES
  • AUTOMATIC RESIZING
  • Read-only mode while resizing Parallel node-to-node data copy New cluster is created in the background Only charged for a single cluster
  • Automatic DNS based endpoint cut-over Deletion of source cluster
  • CREATE A DATAWAREHOUSE IN MINUTES
  • WAY FASTER
  • MEMORY CAPACITY AND CPU ERFORMANCE DOUBLE EVERY 2 YEARS DISK PERFORMANCE DOUBLE EVERY 10 YEARS
  • Progress is not evenly distributed 1980 14,000,000$/TB 100MB 4MB/s Today 450,000 30,000 X 50 X 30$/TB 3TB 200MB/s
  • I/O IS THE MAIN FACTOR FOR PERFORMANCE
  • Id COMPRESSION PER COLUMN ZONE MAPS HARDWARE OPTIMIZE LARGE DATA BLOCK SIZE State 123 COLUMNAR STORAGE Age 20 CA 345 25 WA 678 40 FL
  • TEST: 2 BILLION RECORDS 6 REPRESENTATIVE REQUETS
  • AMAZON REDSHIFT 2xHS1.8XL Vs. 32 NODES, 4.2TB RAM, 1.6PB
  • 12x - 150x FASTER
  • 30 MINUTES 12 SECONDES
  • WAY LESS EXPENSIVE
  • 2x HS1.8XL 3.65$ / HOUR 32 000$ / YEAR
  • Instance HS1.XL per hour Hourly Price per TB Yearly Price per TB On-Demand 0.850 $ 0.425 $ 3 723 $ 1 Year Reservation 0.500 $ 0.250 $ 2 190 $ 3 Years Reservation 0.228 $ 0.114 $ 999 $
  • October, 2013 Intel Analytics on AWS Assaf Araki Intel Confidential
  • Agenda Advanced Analytics @ Intel Enterprise on the Cloud Use Case Intel Confidential
  • Intel AA Team Advanced Analytics Vision: Make analytics a competitive advantage for Intel Mission: Solve strategic high value business line problems Leverage analytics to grow Intel revenue About the team: ~100 employees - corporate ownership of advanced analytics Big data and Machine Learning are key focus areas Skills: Software Engineering / Decision Science / Business Acumen Value driven ROI>$10M and/or key corporate problem as defined by VPs Part of the Israel Academy Computational research center Intel Confidential
  • AA Overview Big Data Analytics Platform Highly scalable, hybrid platform to support a range of business use cases Prediction Module MPP High Speed Data Loader Heterogeneous data, batch oriented on advanced analytics Rich advanced analytics and realtime, in-database data mining capabilities Intel Confidential
  • Enterprise On the Cloud Why Cloud ? Known reasons Reduce cost Universal access Scale fast Additional reasons Flexible & Agile platform no need to certify each tool by engineering team Development accelerator R&D team can start develop while engineering teams implement the platform on premise Intel Confidential
  • Use Case Use Case Characteristics: CPU behavior data Size: 30TB of data per month Type: Structured data Processing: Create aggregation facts and grant ad hoc analysis Create ML solutions Current Status: Data is sampled and processed on SMP RDBMS Takes almost 24 hours to process the entire data Problem Statement Limited ability analyze all data Intel Confidential
  • Enterprise On the Cloud Platforms On premise Hbase Hadoop platform exists No Hbase MPP DB Exists with Machine Learning capabilities Lower cost platform evaluate and purchase Cloud HBase - EMR MPP DB - AWS Redshift Go for POC on the Cloud Intel Confidential
  • Enterprise On the Cloud Evaluation Criteria Capabilities Create statistics calculations Cost of HW per TB Replication Compression Performance Load, transformation, querying Scalability Ability to execute Intel Confidential
  • Use Case Preliminary Results Dataset example 34GB compressed data divided to files ~1,500,000,000 records 24B compressed, 240B per record ( ~15 columns ) Performance & Scalability - 8 x 1XL nodes Load time for 32 files 2 hours ( 4 files 5 hours ) Table size 202GB (compression rate ~1.5:1) SQL aggregation statements 38K records 6 minutes 14M records 7 minutes 66M records 11 minutes ( on 4 x 1XL 22 minutes ) 939M records 34 minutes ( on 4 x 1XL 77 minutes ) Intel Confidential
  • Use Case Capabilities and Cost No current ability to write code (Java/C++/Python/R) Implement statistics and algorithm in SQL Compression is not strait forward Cost sensitive for actual compression 2.6 : 1 is break even 8XL vs. High Storage instance (16 cores 48TB) 3 years with 100% utilization Intel Confidential
  • assaf.araki@intel.com Intel Confidential
  • Thank You! Intel Confidential
  • USE CASE
  • AMAZON EC2 AMAZON DYNAMODB AMAZON RDS AMAZON REDSHIFT AMAZON ELASTIC MAPREDUCE AMAZON S3 DATA CENTER AWS STORAGE GATEWAY
  • UPLOAD TO AMAZON S3 AWS IMPORT/EXPORT AWS DIRECT CONNECT DATA INTEGRATION INTEGRATION SYSTEMS
  • MEMBRES REGISTRATION 15 million 2 million 2011 2012 2013
  • 1,5