build your web analytics with node.js, amazon dynamodb and amazon emr (bdt203) | aws re:invent 2013

Building Your Own Web Analytics Service

with node.js, Amazon DynamoDB, and

Amazon Elastic MapReduce

Jonathan Keebler - Founder, CTO - ScribbleLive

November 13, 2013

Who Am I?

•Jonathan Keebler @keebler

•Built video player for all CTV properties

–Worked on news sites like CTV, TSN, CP24

•CTO, Founder of ScribbleLive

•Bootstrapped a high scalability startup

–Credit card limit wasn’t that high, had to find cheap

ways to handle the load of top tier news sites

What is ScribbleLive?

•Leading provider of real-time engagement

management solutions

•We enable real-time publication and syndication

of digital content

•Our platform is transforming the way the world’s

largest brands and media approach

communication and content creation, creating true

real-time engagement

Some of our customers

•Learn to build your own analytics service

– Seriously, we’re going to do it

•node.js on Amazon EC2: web servers

•Amazon DynamoDB: database

•Hadoop/Hive on Amazon Elastic MapReduce

(EMR): roll-up data

Why would we do this?

•ScribbleLive tracks “engagement minutes” (EMs)

across all customer sites

– e.g., ESPN.com, CNN.com, Reuters.com

– EM = 1 minute of a user watching a webpage

– 2.5B per month, 120M+ per hour

•Big analytics providers couldn’t do it

– Didn’t have the features

– Too inaccurate

How are we going to do this?

Elastic Load Balancing

Visitors

node.js node.js node.js node.js

DynamoDB

DynamoDB: data structure

•Separate tables by timeframe

– Minute (written by node.js directly)

– Hour (EMR from minute data)

– Day (EMR from hour data)

– Month (EMR from day data)

•Structure

– Hash: Item (page id)

– Range: Time (rounded to min, hour, day)

– { Hits: 1 }

Elastic Load Balancing: AMI setup

•Custom AMI

– Loads source from SVN

– Launches node.js

Elastic Load Balancing: Load balancing

•1 load balancer

•Cookies keep unique user on same instance

•Auto-scaling

– CPU >50% or network-in 50M bytes, triggers new

servers coming online and added to Elastic Load

Balancing

node.js: Overview of code

•Accepts GET /?item={ID}&uid={UserID}

•Dictionary/Array of how many GETs per item in this

minute

– Hits[Minute][“{ID}”]++

– Example: Hits[“1/1/2014 1:23:00”][“abcd”]++

•Dictionary/Array of Users already counted in

Item:Minute (prevent double-counting)

•At end of minute, write data back to DynamoDB

node.js: Bulk writing to DynamoDB

•Writing all data back immediately in a loop = BAD!

– Throughput would spike in that ~second

– Would have to use higher throughput limit

– More $$$$

•Instead, figure out how many writes need to happen /

60 seconds = how many writes per second you should do

node.js: Bulk writing to DynamoDB

•Call to DynamoDB per item:

– update: (atomic) add X to {ID}:{Minute}

Hadoop: What we map and reduce

•To go from minute to hourly data

– Round every minute down to the nearest hour (floor( Minute / 3600 ) * 3600)

– Sum the # of “Hits” from each data point

•Just look at the past 24 hours to save time

•Do the same for hourly to daily, daily to monthly

Hadoop: Hive scripts INSERT OVERWRITE TABLE MetricsHourly

SELECT

(floor( Time / 3600 ) * 3600) AS Time,

SUM(Hits) AS Hits,

from_unixtime(floor( Time / 3600 ) * 3600 ) AS TimeFriendly

FROM Metrics WHERE Time >= floor( unix_timestamp() / 86400 ) * 86400 - ( 86400 * 1 )

GROUP BY Item, floor( Time / 3600 ) * 3600;

Hadoop: Setting Up EMR

• “Start an Interactive Hive Session”

• Run a cron job every 15 minutes to check if

the Hive job is complete

• If complete, downloads newest Hive script

and restarts the job

• Amazon CloudWatch alarms if jobs taking

longer than 12 hours

Hadoop: Cron Job #!/bin/sh

JOBID=$(hadoop job -list | grep job_ | cut -f1)

if [ -n "$JOBID" ];then

echo "Another job already running";

echo "Starting Hive job..."

echo `date` starting >> /var/log/metricsdaily_starting

wget -qO- http://DEPLOY/metrics/rollups.sql > /tmp/rollups.sql && hive -f /tmp/rollups.sql

Application API

•RESTful API in the language of your choice

•Calls to DynamoDB:

–query: Hash:{ID} w/ Range:{Time A}-{Time B}

•Since M-R could take a day to run, need to reconstruct

hourly data from minutes for most recent 24 hours

–e.g. if you want hourly data for last 2 days, take 24 hourly data

pts from yesterday, and 24*60 minute data pts from today

(convert to hourly data pts in code)

Performance

Please give us your feedback on this

presentation

As a thank you, we will select prize

winners daily for completed surveys!

BDT203

build your web analytics with node.js, amazon dynamodb and amazon emr (bdt203) | aws re:invent 2013

bulk writing

js node

time 3600

dynamodb

node

time

time

id

Technology

(sdd407) amazon dynamodb: data modeling and scaling best...

amazon wallet: increasing performance with dynamodb

aws re:invent 2016: cross-region replication with amazon...

[awsマイスターシリーズ] amazon dynamodb

deep dive: amazon dynamodb (db tech showcase 2016)

amazon web services: dynamodb 101 with itoc australia - what...

amazon dynamodbの概要説明

amazon dynamodb design patterns for ultra-high performance...

amazon dynamodb - 開発者ガイド · pdf fileamazon...

aws re:invent 2016: deep dive on amazon dynamodb (dat304)

vmware vrealize operations management pack for amazon...

amazon dynamodb lessen's learned by beginner

data modeling and best practices on amazon dynamodb

aws re:invent 2016: streaming etl for rds and dynamodb...

amazon elasticache backed by dynamodb -...

(sdd424) simplifying scalable distributed applications using...

aws re:invent 2016: how telltale games migrated its story...

deep dive on amazon dynamodb

production nosql in an hour: introduction to amazon dynamodb...

massive message processing with amazon sqs and amazon...