AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Download AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Post on 08-Sep-2014




1 download

Embed Size (px)




<ul><li> AWS Summit 2013 Tel Aviv Oct 16 Tel Aviv, Israel Data Warehouse on AWS Guy Ernest Solutions Architecture, Amazon Web Services </li> <li> ERP CRM ANALYST DATAWAREHOUSE DB </li> <li> OLTP ERP OLAP OLTP CRM ANALYST DATAWAREHOUSE OLTP DB </li> <li> Transactional Processing Analytical Processing Transactional context Global context Latency Throughput Indexed access Full table scans Random IO Sequential IO Disk seek times Disk transfer rate </li> <li> OLTP OLAP </li> <li> BUSINESS INTELLIGENCE REPORTS, DASHBOARD, PRODUCTION OFFLOAD DIFFERENT DATA STRUCTURE, USING ETLs, ANALYST DATAWAREHOUSE </li> <li> BIG ENTREPRISES VERY EXPENSIVE (ROI) DIFFICULT TO MAINTAIN NOT SCALABLE </li> <li> BIG ENTREPRISES SME VERY EXPENSIVE (ROI) DIFFICULT TO MAINTAIN NOT SCALABLE WAY TOO EXPENSIVE ! </li> <li> Jeff Bezos </li> <li> Data Sources Value Queries </li> <li> + ELASTIC CAPACITY + NO CAPEX + PAY FOR WHAT YOU USE + DISPOSE ON DEMAND = NO CONTRAINTS </li> <li> ACCELERATION COLLECT STORE ANALYZE SHARE AMAZON REDSHIFT </li> <li> AMAZON REDSHIFT </li> <li> DWH that scales to petabyte and WAY SIMPLER AMAZON REDSHIFT WAY FASTER WAY LESS EXPENSIVE </li> <li> AMAZON REDSHIFT RUNNING ON OPTIMIZED HARDWARE HS1.8XL: 128 GB RAM, 16 Cores, 16 TB Compressed Data, 2 GB/sec Disk Scan HS1.XL: 16 GB RAM, 2 Cores, 2 TB Compressed Data </li> <li> Extra Large Node (HS1.XL) Single Node (2 TB) Cluster 2-32 Nodes (4 TB 64 TB) Eight Extra Large Node (HS1.8XL) Cluster 2-100 Nodes (32 TB 1.6 PB) </li> <li> JDBC/ODBC 10 GigE (HPC) Ingestion Backup Restoration </li> <li> WAY SIMPLER </li> <li> LOADING DATA Parallel Loading Data sorted and distributed automatically Linear Growth </li> <li> DATA SNAPSHOTS Automatic and Incremental snapshots in Amazon S3 Configurable Retention Period Manual Snapshots Streaming Restore </li> <li> REPLICATION IN CLUSTER + AUTOMATIC SNAPSHOT IN AMAZON S3 + MONITORING OF CLUSTER NODES </li> <li> AUTOMATIC RESIZING </li> <li> Read-only mode while resizing Parallel node-to-node data copy New cluster is created in the background Only charged for a single cluster </li> <li> Automatic DNS based endpoint cut-over Deletion of source cluster </li> <li> CREATE A DATAWAREHOUSE IN MINUTES </li> <li> WAY FASTER </li> <li> MEMORY CAPACITY AND CPU ERFORMANCE DOUBLE EVERY 2 YEARS DISK PERFORMANCE DOUBLE EVERY 10 YEARS </li> <li> Progress is not evenly distributed 1980 14,000,000$/TB 100MB 4MB/s Today 450,000 30,000 X 50 X 30$/TB 3TB 200MB/s </li> <li> I/O IS THE MAIN FACTOR FOR PERFORMANCE </li> <li> Id COMPRESSION PER COLUMN ZONE MAPS HARDWARE OPTIMIZE LARGE DATA BLOCK SIZE State 123 COLUMNAR STORAGE Age 20 CA 345 25 WA 678 40 FL </li> <li> TEST: 2 BILLION RECORDS 6 REPRESENTATIVE REQUETS </li> <li> AMAZON REDSHIFT 2xHS1.8XL Vs. 32 NODES, 4.2TB RAM, 1.6PB </li> <li> 12x - 150x FASTER </li> <li> 30 MINUTES 12 SECONDES </li> <li> WAY LESS EXPENSIVE </li> <li> 2x HS1.8XL 3.65$ / HOUR 32 000$ / YEAR </li> <li> Instance HS1.XL per hour Hourly Price per TB Yearly Price per TB On-Demand 0.850 $ 0.425 $ 3 723 $ 1 Year Reservation 0.500 $ 0.250 $ 2 190 $ 3 Years Reservation 0.228 $ 0.114 $ 999 $ </li> <li> October, 2013 Intel Analytics on AWS Assaf Araki Intel Confidential </li> <li> Agenda Advanced Analytics @ Intel Enterprise on the Cloud Use Case Intel Confidential </li> <li> Intel AA Team Advanced Analytics Vision: Make analytics a competitive advantage for Intel Mission: Solve strategic high value business line problems Leverage analytics to grow Intel revenue About the team: ~100 employees - corporate ownership of advanced analytics Big data and Machine Learning are key focus areas Skills: Software Engineering / Decision Science / Business Acumen Value driven ROI&gt;$10M and/or key corporate problem as defined by VPs Part of the Israel Academy Computational research center Intel Confidential </li> <li> AA Overview Big Data Analytics Platform Highly scalable, hybrid platform to support a range of business use cases Prediction Module MPP High Speed Data Loader Heterogeneous data, batch oriented on advanced analytics Rich advanced analytics and realtime, in-database data mining capabilities Intel Confidential </li> <li> Enterprise On the Cloud Why Cloud ? Known reasons Reduce cost Universal access Scale fast Additional reasons Flexible &amp; Agile platform no need to certify each tool by engineering team Development accelerator R&amp;D team can start develop while engineering teams implement the platform on premise Intel Confidential </li> <li> Use Case Use Case Characteristics: CPU behavior data Size: 30TB of data per month Type: Structured data Processing: Create aggregation facts and grant ad hoc analysis Create ML solutions Current Status: Data is sampled and processed on SMP RDBMS Takes almost 24 hours to process the entire data Problem Statement Limited ability analyze all data Intel Confidential </li> <li> Enterprise On the Cloud Platforms On premise Hbase Hadoop platform exists No Hbase MPP DB Exists with Machine Learning capabilities Lower cost platform evaluate and purchase Cloud HBase - EMR MPP DB - AWS Redshift Go for POC on the Cloud Intel Confidential </li> <li> Enterprise On the Cloud Evaluation Criteria Capabilities Create statistics calculations Cost of HW per TB Replication Compression Performance Load, transformation, querying Scalability Ability to execute Intel Confidential </li> <li> Use Case Preliminary Results Dataset example 34GB compressed data divided to files ~1,500,000,000 records 24B compressed, 240B per record ( ~15 columns ) Performance &amp; Scalability - 8 x 1XL nodes Load time for 32 files 2 hours ( 4 files 5 hours ) Table size 202GB (compression rate ~1.5:1) SQL aggregation statements 38K records 6 minutes 14M records 7 minutes 66M records 11 minutes ( on 4 x 1XL 22 minutes ) 939M records 34 minutes ( on 4 x 1XL 77 minutes ) Intel Confidential </li> <li> Use Case Capabilities and Cost No current ability to write code (Java/C++/Python/R) Implement statistics and algorithm in SQL Compression is not strait forward Cost sensitive for actual compression 2.6 : 1 is break even 8XL vs. High Storage instance (16 cores 48TB) 3 years with 100% utilization Intel Confidential </li> <li> Intel Confidential </li> <li> Thank You! Intel Confidential </li> <li> USE CASE </li> <li> AMAZON EC2 AMAZON DYNAMODB AMAZON RDS AMAZON REDSHIFT AMAZON ELASTIC MAPREDUCE AMAZON S3 DATA CENTER AWS STORAGE GATEWAY </li> <li> UPLOAD TO AMAZON S3 AWS IMPORT/EXPORT AWS DIRECT CONNECT DATA INTEGRATION INTEGRATION SYSTEMS </li> <li> MEMBRES REGISTRATION 15 million 2 million 2011 2012 2013 </li> <li> 1,500,000+ NEW MEMBRES EACH MONTH </li> <li> 1,200,000,000+ SOCIAL CONNECTIONS IMPORTED </li> <li> Join via Facebook Raw Data Amazon S3 Web Servers Add a Skill Page User Action Trace Events Invite Friends Get Data Aggregated Data Amazon Redshift Amazon S3 Raw Events EMR Tableau Excel Data Analyst Internal Web Hive Scripts Process Content Process log files with regular expressions to parse out the info we need. Processes cookies into useful searchable data such as Session, UserId, API Security token. Filters surplus info like internal varnish logging. </li> <li> ELASTIC DATA WAREHOUSE </li> <li> Monthly Reports on a new cluster </li> <li> S3 EMR Redshift Reporting and BI </li> <li> OLTP Web Apps DynamoDB Redshift Reporting and BI </li> <li> OLTP ERP RDBMS Redshift Reporting &amp; BI </li> <li> OLTP ERP RDBMS + Redshift Reporting &amp; BI </li> <li> JDBC/ODBC Amazon Redshift </li> <li> DATAWAREHOUSE BY AWS Simple to use and scalable Pay per use, no CAPEX Low cost for high performances Open and integrate with existing BI tools </li> <li> Speed and Agility On Premise Fewer Experiments Frequent Experiments High Cost of Failures Low Cost of Failure Less Innovation More Innovation </li> </ul>


View more >