grokking engineering - data analytics infrastructure at viki - huy nguyen

Analytics Infrastructure @ Viki

Grokking Engineering

Dec 2014

Talk Outline

• Introduction

• Background + Problems

• Data Architecture

– Data Collection & Storage

– Data Processing & Aggregation

– Data Presentation & Vizualization

– Real-time dashboard and alerts

• Other Comments

What’s Viki?

Link

http://www.viki.com/videos/1010376v-official-viki-channel-viki-concept-video-episode-9

Youtube - A Typical Web Application

• Daily/weekly registered users by different platforms, countries?

• How many video uploads do we have everyday?

Behavioral Data? (vs Transactional Data)

• Transactional Data

Mission-critical data (e.g user accounts, bookings, payments)

Often fixed schema

Lower volume

Transaction control

• Behavioral Data

Logging data (e.g. page view, video start, ad impression)

Often semi-structure (JSON)

Huge volume

No transaction control

Data Infrastructure

Data Infrastructure

1.Collect and Store Data

2.Centralize and Process Data

3.Present and Vizualize Data

1. Collect & Store Data

{

"origin":"tv_show_show", "app_ver":"2.9.3.151”,

"uuid":"80833c5a760597bf1c8339819636df04”,

"user_id":"5298933u”,

"vs_id":"1008912v-1380452660-7920”,

"app_id":"100004a”, "event":”video_play",

"timed_comment":"off”, "stream_quality":"variable”,

"bottom_subtitle":"en", "device_size":"tablet”,

"feature":"auto_play", "video_id":"1008912v”,

”subtitle_completion_percent":"100”,

"device_id":"iPad2,1|6.1.3|apple", "t":"1380452846”,

"ip":"99.232.169.246”, "country":"ca”,

"city_name":"Toronto”, "region_name":"ON”

}

• Samples: page view, video start,

ad impression, etc.

• Behavioural Data

Semi-structured (JSON)

Massive Volume (100M+/day)

Does not fit traditional RDBMS

databases

• fluentd

Scalable

Extensible

Forward data to Hadoop, MongoDB, PostgreSQL etc.

1. Collect & Store Data

Hydration System

• Inject time-sensitive information into events

Hydration System

2. Centralizing & Processing Data

• Centralizing All Data Sources

• Cleaning Data

• Transforming Data

• Managing Job Dependencies

2. Centralizing & Processing Data

Getting All Data To 1 Place

thor db:cp --source prod1 --destination analytics -t public.* --force-schema prod1

thor db:cp --source A --destination B –t reporting.video_plays --increment

{"origin":"tv_show_show", "app_ver":"2.9.3.151", "uuid":"80833c5a760597bf1c8339819636df04", "user_id":"5298933u", "vs_id":"1008912v-1380452660-7920", "app_id":"100004a”, "event":”video_play","timed_comment":"off”, "stream_quality":"variable”, "bottom_subtitle":"en", "device_size":"tablet", "feature":"auto_play", "video_id":"1008912v", ”subtitle_completion_percent":"100", "device_id":"iPad2,1|6.1.3|apple", "t":"1380452846", "ip":"99.232.169.246”, "country":"ca", "city_name":"Toronto”, "region_name":"ON"}…

date source partner event video_id country cnt

2013-09-29 ios viki video_play 1008912v ca 2

2013-09-29 android viki video_play 1008912v us 18

…

b) Click-stream Data (Hadoop) Analytics DB:

Hadoop

PostgreSQL

Aggregation (Hive)

Export Output / Sqoop

SELECT

SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`,

v['source'],

v['partner'],

v['event'],

v['video_id'],

v['country'],

COUNT(1) as cnt

FROM events

WHERE TIME_RANGE(time, '2013-09-29', '2013-09-30')

AND v['event'] = 'video_play'

GROUP BY

SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ),

v['source'],

v['partner'],

v['event'],

v['video_id'],

v['country'];

Simple Aggregation SQL

The Data Is Not Clean!

Event properties and names change as we

develop:

But…

{"user_id": "152u”, "country": "sg" }

{"user_id": "152", "country_code":"sg" }Old Version:

New Version:

SELECT SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`,

v['app_id'] AS `app_id`,

CASE WHEN v['app_ver'] LIKE '%_ax' THEN 'axis'

WHEN v['app_ver'] LIKE '%_kd' THEN 'amazon'

WHEN v['app_ver'] LIKE '%_kf' THEN 'amazon'

WHEN v['app_ver'] LIKE '%_lv' THEN 'lenovo'

WHEN v['app_ver'] LIKE '%_nx' THEN 'nexian'

WHEN v['app_ver'] LIKE '%_sf' THEN 'smartfren'

WHEN v['app_ver'] LIKE '%_vp' THEN 'samsung_viki_premiere'

ELSE LOWER( v['partner'] )

END AS `partner`,

CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct'

WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct'

WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) THEN 'embed'

WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android'

WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios'

ELSE TRIM( v['source'] )

END AS `source` ,

LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2

THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) )

ELSE NULL END ) AS `country` ,

COALESCE ( v['device_size'] ,v['device'] ) AS `device`,

COUNT( 1 ) AS `cnt`

FROM events

WHERE time >= 1380326400 AND time <= 1380412799

AND v['event'] = 'video_play'

GROUP BY

SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ), v['app_id'],

CASE WHEN v['app_ver'] LIKE '%_ax'

THEN 'axis' WHEN v['app_ver'] LIKE '%_kd'

THEN 'amazon' WHEN v['app_ver'] LIKE '%_kf'

THEN 'amazon' WHEN v['app_ver'] LIKE '%_lv'

THEN 'lenovo' WHEN v['app_ver'] LIKE '%_nx'

THEN 'nexian' WHEN v['app_ver'] LIKE '%_sf'

THEN 'smartfren' WHEN v['app_ver'] LIKE '%_vp'

THEN 'samsung_viki_premiere'

ELSE LOWER( v['partner'] )

END ,

CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct'

WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct'

WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) )

THEN 'embed' WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android'

WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios'

ELSE TRIM( v['source'] ) END,

LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2

THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) )

ELSE NULL END ),

COALESCE ( v['device_size'] ,v['device'] );

(Not so) simple Aggregation SQL

Hadoop

UPDATE "reporting"."cl_main_2013_09"

SET source = 'embed', partner = ’partner1'

WHERE app_id = '100105a' AND (source != 'embed' OR partner != ’partner1')


SET app_id = '100105a'

WHERE (source = 'embed' AND partner = ’partner1') AND (app_id != '100105a')

UPDATE reporting.cl_main_2013_09

SET user_id = user_id || 'u’

WHERE RIGHT(user_id, 1) ~ '[0-9]’


SET app_id = '100106a'

WHERE (source = 'embed' AND partner = ’partner2') AND (app_id != '100106a')

UPDATE reporting.cl_main_2013_09

SET source = 'raynor', partner = 'viki', app_id = '100000a’

WHERE event = 'pv’

AND source IS NULL

AND partner IS NULL

AND app_id IS NULL

…post-import cleanup

PostgreSQL

Cleaning Up Data Takes Lots of Time

Transforming Data


• Cleaning Data



Transforming Data

…

Table A

Table B

…

Analytics DB (PostgreSQL)

date source partner event country cnt

2013-09-29 ios viki video_play ca 20

…

date source partner event video_id country cnt



…

PostgreSQL

20M records

4M records

a) Reducing Table Size By Dropping Dimension (Aggregation)

video_plays_with_video_id

video_plays

id title

1c Game of Thrones

2c How I Met Your Mother

…

PostgreSQL

b) Injecting Extra Fields For Analysis

id title num_videos

1c Game of Thrones

30

2c How I Met Your Mother

16

…

shows videos

shows shows

1 n

id title

1c Game of Thrones

2c My Girlfriend Is A Gumiho

…

PostgreSQL

Injecting Extra Fields For Analysis

id title video_count

1c Game of Thrones

30

2c My Girlfriend Is A Gumiho

16

…

containers videos

containers containers

1 n

Chunk Tables By Month

video_plays_2013_06

video_plays_2013_07

video_plays_2013_08

video_plays_2013_09

…

ALTER TABLE video_plays_2013_09 INHERIT

video_plays;

ALTER TABLE video_plays_2013_09

ADD CONSTRAINT CHECK

date >= '2013-09-01'

AND date < '2013-10-01';

video_plays (parent table)

Managing Job Dependency


• Cleaning Data




…

Job A

Job B

…



…

tableA

tableB

…


Azkaban

Cron dependency

management

(Viki Cron Dependency Graph)

3. Data Presentation and Visualization

Query Reports

Summary report

• Higher level view of metrics

• See changes over time

• (screen shot)

Data Explorer“The world is your oyster”

4. Real Time Infrastructure

Real Time Infrastructure (Apache

Storm)

Real Time Dashboard

Alerts

Know when the house is burning down!

Then Global Content Source and

Consumption

Our Technology Stack

• Languages/Frameworks

– Ruby, Rails, Python, Go, JavaScript, NodeJS

– Fluentd (Log collector)

– Java, Apache Storm, Kestrel

• Databases

– PostgreSQL, MongoDB, Redis

– Hadoop/Hive, Amazon Redshift

– Amazon Elastic MapReduce

Hadoop vs. Amazon Redshift

• Hadoop is a big-data storage and processing engine

platform

– HDFS: data-storage layer

– YARN: resource management

– MapReduce/Pig/Hive/Spark: processing layer

• Amazon Redshift (MPP, massively parallel processing)

– Columnar-storage database. Meant for analytics purpose.

– OLAP – Online Analytics Processing

– Examples: Vertica, Amazon Redshift, Parracel

Thank You!

[email protected]

http://engineering.viki.com/blog/2014/data-warehouse-and-analytics-infrastructure-at-viki/

http://bit.ly/viki-datawarehouse

engineering.viki.com

grokking engineering - data analytics infrastructure at viki - huy nguyen

Data & Analytics

accounting data

vizualize data

video start

video uploads

performing shows video

percent completion rate

geographic region

user registration web