aws summit tel aviv - startup track - data analytics & big data
Post on 15-Jan-2015
474 Views
Preview:
DESCRIPTION
TRANSCRIPT
AWS Summit 2013 Tel Aviv Oct 16 – Tel Aviv, Israel
Jan Borch | AWS Solutions Architect
Data Analytics on BigData
GENERATE STORE ANALYZE SHARE
THE COST OF DATA
GENERATION IS FALLING
Progress is not evenly distributed
1980 Today
14,000,000$/TB
100MB
4MB/s
30$/TB
3TB
200MB/s
30,000 X
50 X
450,000 ÷
THE MORE DATA YOU COLLECT
THE MORE VALUE YOU CAN
DERIVE FROM IT
GENERATE STORE ANALYZE SHARE
Lower cost,
higher throughput
GENERATE STORE ANALYZE SHARE
Lower cost,
higher throughput
Highly
constrained
Generated data
Available for analysis
DATA VOLUME
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
GENERATE STORE ANALYZE SHARE
GENERATE STORE ANALYZE SHARE
ACCELERATE
+ ELASTIC AND HIGHLY SCALABLE
+ NO UPFRONT CAPITAL EXPENSE
+ ONLY PAY FOR WHAT YOU USE
+ AVAILABLE ON-DEMAND
= REMOVE CONSTRAINTS
GENERATE STORE ANALYZE SHARE
AWS EC2
AWS CloudFront
• Fluentd
• Flume
• Scribe
• Chukwa
• LogStash
{output{ s3 {
bucket => myBucket,
aws_credential_file => ~/cred.json
size_file=> 120MB
}}
“Poor man’s Analytics”
Embed poor-man pixel
http://www.poor-man-analytics.com/__track.gif?idt=5.1.5&idc=5&utmn=1532897343&utmhn=www.douban.com&utmcs=UTF-8&utmsr=1440x900&utmsc=24-bit&utmul=en-us&utmje=1&utmfl=10.3%20r181&utmdt=%E8%B1%86%E7%93%A3&utmhid=571356425&utmr=-&utmp=%2F&utmac=UA-7019765-1&utmcc=__utma%3D30149280.1785629903.1314674330.1315290610.1315452707.10%3B%2B__utmz%3D30149280.1315452707.10.7.utmcsr%3Dbiaodianfu.com%7Cutmccn%3D(referral)%7Cutmcmd%3Dreferral%7Cutmcct%3D%2Fpoor-man-analytics-architecture.html%3B%2B__utmv%3D30149280.162%3B&utmu=qBM~
GENERATE STORE ANALYZE SHARE
GENERATE STORE ANALYZE SHARE
AWS Import / Export
AWS Direct Connect
AWS Elastic Map Reduce
Generated and stored in AWS
Inbound data transfer is free
Multipart upload to S3
Physical media
AWS Direct Connect
Regional replication of AMIs and snapshots
Aggregation with S3Distcp
S3distcp on EMR job sample
./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar \
/home/hadoop/lib/emr-s3distcp-1.0.jar \
--args \
'--src,s3://myawsbucket/cf,\
--dest,s3://myoutputbucket/aggregate ,\
--groupBy,.*XABCD12345678.([0-9]+-[0-9]+-[0-9]+-[0-9]+).*,\
--targetSize,128,\
--outputCodec,lzo,\
--deleteOnSuccess'
GENERATE STORE ANALYZE SHARE
Amazon S3,
Amazon Glacier,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
AWS Storage Gateway,
Data on Amazon EC2
AMAZON S3 SIMPLE STORAGE SERVICE
AMAZON
DYNAMODB HIGH-PERFORMANCE, FULLY MANAGED
NoSQL DATABASE SERVICE
DURABLE &
AVAILABLE CONSISTENT, DISK-ONLY
WRITES (SSD)
LOW LATENCY AVERAGE READS < 5MS,
WRITES < 10MS
NO ADMINISTRATION
ad-id advertiser max-price imps to
deliver
imps
delivered
1 AAA 100 50000 1200
2 BBB 150 30000 2500
user-id attribute1 attribute2 attribute3 attribute4
A XXX XXX XXX XXX
B YYY YYY YYY YYY
not many
rows
so many
rows
frequent
update
(near realtime)
batch manner update
Ads
Profiles
Very general table structure
500,000 WRITES PER SECOND
DURING SUPER BOWL
AMAZON
GLACIER reliable long term archiving
AMAZON S3 Archive to
Amazon Glacier
S3 Lifecycle policies
If object older than 5 month
AMAZON S3
Delete object from S3
S3 Lifecycle policies
/dev/null
If object older than 5 month
If object older than 1 year
AMAZON
REDSHIFT FULLY MANAGED, PETA-BYTE SCALE
DATAWAREHOUSE ON AWS
DESIGN OBJECTIVES: A petabyte-scale data warehouse service that was…
AMAZON REDSHIFT
A Whole Lot Simpler
A Lot Cheaper
A Lot Faster
AMAZON REDSHIFT
RUNS ON OPTIMIZED HARDWARE
HS1.8XL: 128 GB RAM, 16 Cores, 16 TB compressed user storage, 2 GB/sec scan rate
HS1.XL: 16 GB RAM, 2 Cores, 2 TB compressed customer storage
30 MINUTES
DOWN TO
12 SECONDS
Extra Large Node
(HS1.XL)
Single Node (2 TB)
Cluster 2-32 Nodes (4 TB – 64 TB)
AMAZON REDSHIFT LETS YOU
START SMALL AND GROW BIG
Eight Extra Large Node (HS1.8XL) Cluster 2-100 Nodes (32 TB – 1.6 PB)
JDBC/ODBC
Price Per Hour for
HS1.XL Single
Node
Effective Hourly
Price Per TB
Effective Annual
Price per TB
On-Demand $ 0.850 $ 0.425 $ 3,723
1 Year
Reservation $ 0.500 $ 0.250 $ 2,190
3 Year
Reservation $ 0.228 $ 0.114 $ 999
DATA WAREHOUSING DONE THE AWS WAY
No upfront costs, pay as you go
Really fast performance at a really low price
Open and flexible with support for popular tools
Easy to provision and scale up massively
USAGE SCENARIOS
Reporting Warehouse
Accelerated operational reporting
Support for short-time use cases
Data compression, index redundancy
RDBMS Redshift
OLTP ERP Reporting
and BI
Data Integration Partners*
On-Premises Integration
RDBMS Redshift
OLTP ERP Reporting
and BI
Live Archive for (Structured) Big Data
Direct integration with copy command
High velocity data
Data ages into Redshift
Low cost, high scale option for new apps
DynamoDB Redshift
OLTP Web Apps Reporting
and BI
Cloud ETL for Big Data
Maintain online SQL access to historical logs
Transformation and enrichment with EMR
Longer history ensures better insight
Redshift Reporting and BI Elastic MapReduce
S3
create table cf_logs
( d date,
t char(8),
edge char(4),
bytes int,
cip varchar(15),
verb char(3), distro varchar(MAX), object varchar(MAX), status int,
Referer varchar(MAX), agent varchar(MAX), qs varchar(MAX) )
COPY into Amazon Redshift
copy cf_logs from 's3://cfri/cflogs-sm/E123ABCDEF/'
credentials
'aws_access_key_id=<key_id>;aws_secret_access_key=<secret_key>'
IGNOREHEADER 2
GZIP
DELIMITER '\t'
DATEFORMAT 'YYYY-MM-DD'
COPY into Amazon Redshift
GENERATE STORE ANALYZE SHARE
Amazon EC2
Amazon Elastic
MapReduce
AMAZON EC2 ELASTIC COMPUTE CLOUD
Virtual core: 1
Memory: 1.7 GiB
I/O performance: Moderate
m1.small
EC2 instance families – General purpose
cc2.8xlarge
Virtual core: 32 - 2 x Intel Xeon
Memory: 60,5 GiB
I/O performance: 10 Gbit
m1.small
EC2 instance families – Compute optimized
cc2.8xlarge m1.small cr1.8xlarge
Virtual core: 32 - 2 x Intel Xeon
Memory: 240 GiB
I/O performance: 10 Gbit
SSD Instance store: 240 GB
EC2 instance families – Memory optimized
cc2.8xlarge m1.small cr1.8xlarge hi.4xlarge
Virtual core: 16
Memory: 60.5 GiB
I/O performance: 10 Gbit
SSD Instance store: 2 x 1TB
hs1.8xlarge
Virtual core: 16
Memory: 117 GiB
I/O performance: 10 Gbit
Instance store: 24 x 2TB
EC2 instance families – Storage optimized
ON A SINGLE INSTANCE
COMPUTE TIME: 4h
COST: 4h x $2.1 = $8.4
ON MULTIPLE INSTANCES
COMPUTE TIME: 1h
COST: 1h x 4 x $2.1 = $8.4
3 HOURS FOR $4828.85/hr
Instead of
$20+ MILLIONS
in infrastructure
• A FRAMEWORK
• SPLITS DATA INTO PIECES
• LETS PROCESSING OCCUR
• GATHERS THE RESULTS
AMAZON ELASTIC
MAPREDUCE HADOOP AS A SERVICE
Corporate Data
Center
Elastic Data
Center
Corporate Data
Center
Elastic Data
Center
Application data
and logs for
analysis pushed
to S3
Corporate Data
Center
Elastic Data
Center
Amazon Elastic
Map Reduce
master node to
control analysis
M
Corporate Data
Center
Elastic Data
Center
Hadoop cluster
started by Elastic
Map Reduce
M
Corporate Data
Center
Elastic Data
Center
M
Adding many
hundreds or
thousands of
nodes
Corporate Data
Center
Elastic Data
Center
M
Disposed of when
job completes
Corporate Data
Center
Elastic Data
Center
Results of
analysis pulled
back into your
systems
Your Spreadsheet does not
scale …
PIG
A real Pig script
(used at Twitter)
Run on
a sample
dataset on
your Laptop
$ pig –f myPigFile.q
Elastic Data
Center
M
Run the same
script on a
50 node
Hadoop cluster
$ ./elastic-mapreduce --create
--name "$USER's Pig JobFlow"
--pig-script
--args s3://myawsbucket/mypigquery.q
--instance-type m1.xlarge --instance-count 50
$ elastic-mapreduce -j j-21IMWIA28LRK1
--add-instance-group task
--instance-count 10
--instance-type m1.xlarge
GENERATE STORE ANALYZE SHARE
Amazon S3,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
Data on Amazon EC2
PUBLIC DATA SETS http://aws.amazon.com/publicdatasets
GENERATE STORE ANALYZE SHARE
AWS Data Pipeline
AWS Data Pipeline
Data-intensive orchestration and automation
Reliable and scheduled
Easy to use, drag and drop
Execution and retry logic
Map data dependencies
Create and manage compute resources
GENERATE STORE ANALYZE SHARE
Amazon S3,
Amazon Glacier,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
AWS Storage Gateway,
Data on Amazon EC2
AWS Import / Export
AWS Direct Connect
Amazon S3,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
Data on Amazon EC2
Amazon EC2
Amazon Elastic
MapReduce
AWS Data Pipeline
FROM DATA TO
ACTIONABLE
INFORMATION
Shlomi Vaknin
Amazon AWS generates big data core component for Ginger Software
Shlomi Vaknin
Oct 16, 2013
118
English writing assistant
An open platform for personal assistants
119
• Users talk naturally with any mobile application, Ginger understands and executes their command
• An end-to-end Speech-to-Action solution
• First open platform for creating personal assistants
120
Natural language speech interface for mobile apps
Proofreader
Speech Engine
Rephrase
PA Platform DB
Semantic Model
Writing Assistant Personal Coach
Query Understanding
NLP/NLU Algorithms
Web Corpus Language model
Domain Corpus
User Corpus
122
• A collection of all the language we found on the internet, accessible and pre-processed
• Has to contain lots and lots of sentences
• Needs to represent “common written language”
• Accessible both for offline (research) and online (service) uses
Our platform depends on scanning and indexing all the language we can find on the internet
123
1. Crawling [own cluster, EMR+S3] • Generated about 50 TB of raw data • Reduced to about 5 TB of text data
2. Post processing [EMR+S3]
3. Indexing/Serving [EMR+S3] • Key/Value – has to be super fast • Full-text-search
4. Archiving (Glacier) [S3+Glacier] • Keeping data available for later research while minimizing cost
• Tokenize • Normalize • Split to n-grams
• Generalize • Count • Filter
124
• Mainly an NLP task
• So we picked up • It’s a Lisp! • Integrates very well with EMR, S3, etc..
• n-Gram Counting • How are you, How are, are you, How, are, you • Lots of grams are repeated • Generalize contextually similar tokens
• Fits map-reduce paradigm very well • Most parts can be trivially parallelized • One part is sequential by grams
125
• EMR cluster node types • Master, Task, Core
• Ratio between Core and Task nodes • We expected a very large output (100TB)
• m2.4xlarge core output 1690GB
• core nodes
• Estimate number of total map tasks
• Final specs: Node Type Instance Count
MASTER cc2.8xlarge 1
CORE m2.4xlarge 200
TASK m2.2xlarge 500
126
• Job took about 30 hours to complete
• We generated nearly 100TB of output data
• During map phase, the cluster achieved nearly 100% utilization
• After initial filtration, 20TB remained
127
• Stay up to date with AMI releases • Don't stick to an old AMI just because it previously worked
• Use the Job-Tracker • Use custom progress notification • Increase mapred.task.timeout
• Limit number of concurrent map tasks • Use the minimum number that gets you close to 100% CPU
• Beware of spot nodes • If you ask for too many you might compete against your own price
128
• Stash the data for later use, to reduce cost
• Glacier offers very cheap storage
• Important things to know about Glacier: • Restoring the data could be VERY expensive • The key to reduce restore costs - restore SLOWLY • There is no built-in mechanism to restore slowly
• 3rd party application • do it manually
• Glacier is very useful if your use case matches its design
129
• EMR/S3 provides great power and elasticity, to grow and shrink as required
• Do your homework before running large jobs!
130
• Our platforms depends on scanning and indexing all the language we can find on the internet
• To achieve this Ginger Software makes heavy use of Amazon EMR
• With Amazon EMR, Ginger Software can scale up vast amounts of computing power and scale back down when it is not needed
• This gives Ginger Software the ability to create the world’s most accurate language enhancement technology without the need to have expensive hardware lying idle during quiet periods
We are hiring! shlomiv@gingersoftware.com
Thank You!
top related