grokking engineering - data analytics infrastructure at viki - huy nguyen
TRANSCRIPT
Analytics Infrastructure @ Viki
Grokking Engineering
Dec 2014
Talk Outline
• Introduction
• Background + Problems
• Data Architecture
– Data Collection & Storage
– Data Processing & Aggregation
– Data Presentation & Vizualization
– Real-time dashboard and alerts
• Other Comments
What’s Viki?
Link
Youtube - A Typical Web Application
• Daily/weekly registered users by different platforms, countries?
• How many video uploads do we have everyday?
Youtube - A Typical Web Application
• Daily/weekly registered users by different platforms, countries?
• How many video uploads do we have everyday?
Behavioral Data? (vs Transactional Data)
• Transactional Data
Mission-critical data (e.g user accounts, bookings, payments)
Often fixed schema
Lower volume
Transaction control
• Behavioral Data
Logging data (e.g. page view, video start, ad impression)
Often semi-structure (JSON)
Huge volume
No transaction control
Data Infrastructure
Data Infrastructure
1.Collect and Store Data
2.Centralize and Process Data
3.Present and Vizualize Data
1. Collect & Store Data
{
"origin":"tv_show_show", "app_ver":"2.9.3.151”,
"uuid":"80833c5a760597bf1c8339819636df04”,
"user_id":"5298933u”,
"vs_id":"1008912v-1380452660-7920”,
"app_id":"100004a”, "event":”video_play",
"timed_comment":"off”, "stream_quality":"variable”,
"bottom_subtitle":"en", "device_size":"tablet”,
"feature":"auto_play", "video_id":"1008912v”,
”subtitle_completion_percent":"100”,
"device_id":"iPad2,1|6.1.3|apple", "t":"1380452846”,
"ip":"99.232.169.246”, "country":"ca”,
"city_name":"Toronto”, "region_name":"ON”
}
• Samples: page view, video start,
ad impression, etc.
• Behavioural Data
Semi-structured (JSON)
Massive Volume (100M+/day)
Does not fit traditional RDBMS
databases
• fluentd
Scalable
Extensible
Forward data to Hadoop, MongoDB, PostgreSQL etc.
1. Collect & Store Data
Hydration System
• Inject time-sensitive information into events
Hydration System
2. Centralizing & Processing Data
• Centralizing All Data Sources
• Cleaning Data
• Transforming Data
• Managing Job Dependencies
2. Centralizing & Processing Data
• Centralizing All Data Sources
• Cleaning Data
• Transforming Data
• Managing Job Dependencies
2. Centralizing & Processing Data
Getting All Data To 1 Place
thor db:cp --source prod1 --destination analytics -t public.* --force-schema prod1
thor db:cp --source A --destination B –t reporting.video_plays --increment
{"origin":"tv_show_show", "app_ver":"2.9.3.151", "uuid":"80833c5a760597bf1c8339819636df04", "user_id":"5298933u", "vs_id":"1008912v-1380452660-7920", "app_id":"100004a”, "event":”video_play","timed_comment":"off”, "stream_quality":"variable”, "bottom_subtitle":"en", "device_size":"tablet", "feature":"auto_play", "video_id":"1008912v", ”subtitle_completion_percent":"100", "device_id":"iPad2,1|6.1.3|apple", "t":"1380452846", "ip":"99.232.169.246”, "country":"ca", "city_name":"Toronto”, "region_name":"ON"}…
date source partner event video_id country cnt
2013-09-29 ios viki video_play 1008912v ca 2
2013-09-29 android viki video_play 1008912v us 18
…
b) Click-stream Data (Hadoop) Analytics DB:
Hadoop
PostgreSQL
Aggregation (Hive)
Export Output / Sqoop
SELECT
SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`,
v['source'],
v['partner'],
v['event'],
v['video_id'],
v['country'],
COUNT(1) as cnt
FROM events
WHERE TIME_RANGE(time, '2013-09-29', '2013-09-30')
AND v['event'] = 'video_play'
GROUP BY
SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ),
v['source'],
v['partner'],
v['event'],
v['video_id'],
v['country'];
Simple Aggregation SQL
The Data Is Not Clean!
Event properties and names change as we
develop:
But…
{"user_id": "152u”, "country": "sg" }
{"user_id": "152", "country_code":"sg" }Old Version:
New Version:
SELECT SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`,
v['app_id'] AS `app_id`,
CASE WHEN v['app_ver'] LIKE '%_ax' THEN 'axis'
WHEN v['app_ver'] LIKE '%_kd' THEN 'amazon'
WHEN v['app_ver'] LIKE '%_kf' THEN 'amazon'
WHEN v['app_ver'] LIKE '%_lv' THEN 'lenovo'
WHEN v['app_ver'] LIKE '%_nx' THEN 'nexian'
WHEN v['app_ver'] LIKE '%_sf' THEN 'smartfren'
WHEN v['app_ver'] LIKE '%_vp' THEN 'samsung_viki_premiere'
ELSE LOWER( v['partner'] )
END AS `partner`,
CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct'
WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct'
WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) THEN 'embed'
WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android'
WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios'
ELSE TRIM( v['source'] )
END AS `source` ,
LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2
THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) )
ELSE NULL END ) AS `country` ,
COALESCE ( v['device_size'] ,v['device'] ) AS `device`,
COUNT( 1 ) AS `cnt`
FROM events
WHERE time >= 1380326400 AND time <= 1380412799
AND v['event'] = 'video_play'
GROUP BY
SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ), v['app_id'],
CASE WHEN v['app_ver'] LIKE '%_ax'
THEN 'axis' WHEN v['app_ver'] LIKE '%_kd'
THEN 'amazon' WHEN v['app_ver'] LIKE '%_kf'
THEN 'amazon' WHEN v['app_ver'] LIKE '%_lv'
THEN 'lenovo' WHEN v['app_ver'] LIKE '%_nx'
THEN 'nexian' WHEN v['app_ver'] LIKE '%_sf'
THEN 'smartfren' WHEN v['app_ver'] LIKE '%_vp'
THEN 'samsung_viki_premiere'
ELSE LOWER( v['partner'] )
END ,
CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct'
WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct'
WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) )
THEN 'embed' WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android'
WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios'
ELSE TRIM( v['source'] ) END,
LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2
THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) )
ELSE NULL END ),
COALESCE ( v['device_size'] ,v['device'] );
(Not so) simple Aggregation SQL
Hadoop
UPDATE "reporting"."cl_main_2013_09"
SET source = 'embed', partner = ’partner1'
WHERE app_id = '100105a' AND (source != 'embed' OR partner != ’partner1')
UPDATE "reporting"."cl_main_2013_09"
SET app_id = '100105a'
WHERE (source = 'embed' AND partner = ’partner1') AND (app_id != '100105a')
UPDATE reporting.cl_main_2013_09
SET user_id = user_id || 'u’
WHERE RIGHT(user_id, 1) ~ '[0-9]’
UPDATE "reporting"."cl_main_2013_09"
SET app_id = '100106a'
WHERE (source = 'embed' AND partner = ’partner2') AND (app_id != '100106a')
UPDATE reporting.cl_main_2013_09
SET source = 'raynor', partner = 'viki', app_id = '100000a’
WHERE event = 'pv’
AND source IS NULL
AND partner IS NULL
AND app_id IS NULL
…post-import cleanup
PostgreSQL
Cleaning Up Data Takes Lots of Time
Transforming Data
• Centralizing All Data Sources
• Cleaning Data
• Transforming Data
• Managing Job Dependencies
Transforming Data
…
Table A
Table B
…
Analytics DB (PostgreSQL)
date source partner event country cnt
2013-09-29 ios viki video_play ca 20
…
date source partner event video_id country cnt
2013-09-29 ios viki video_play 1v ca 2
2013-09-29 ios viki video_play 2v ca 18
…
PostgreSQL
20M records
4M records
a) Reducing Table Size By Dropping Dimension (Aggregation)
video_plays_with_video_id
video_plays
id title
1c Game of Thrones
2c How I Met Your Mother
…
PostgreSQL
b) Injecting Extra Fields For Analysis
id title num_videos
1c Game of Thrones
30
2c How I Met Your Mother
16
…
shows videos
shows shows
1 n
id title
1c Game of Thrones
2c My Girlfriend Is A Gumiho
…
PostgreSQL
Injecting Extra Fields For Analysis
id title video_count
1c Game of Thrones
30
2c My Girlfriend Is A Gumiho
16
…
containers videos
containers containers
1 n
Chunk Tables By Month
video_plays_2013_06
video_plays_2013_07
video_plays_2013_08
video_plays_2013_09
…
ALTER TABLE video_plays_2013_09 INHERIT
video_plays;
ALTER TABLE video_plays_2013_09
ADD CONSTRAINT CHECK
date >= '2013-09-01'
AND date < '2013-10-01';
video_plays (parent table)
Managing Job Dependency
• Centralizing All Data Sources
• Cleaning Data
• Transforming Data
• Managing Job Dependencies
Managing Job Dependency
…
Job A
Job B
…
Analytics DB (PostgreSQL)
Managing Job Dependency
…
tableA
tableB
…
Analytics DB (PostgreSQL)
Azkaban
Cron dependency
management
(Viki Cron Dependency Graph)
3. Data Presentation and Visualization
Query Reports
Summary report
• Higher level view of metrics
• See changes over time
• (screen shot)
Data Explorer“The world is your oyster”
4. Real Time Infrastructure
Real Time Infrastructure (Apache
Storm)
Real Time Dashboard
Alerts
Know when the house is burning down!
Then Global Content Source and
Consumption
Our Technology Stack
• Languages/Frameworks
– Ruby, Rails, Python, Go, JavaScript, NodeJS
– Fluentd (Log collector)
– Java, Apache Storm, Kestrel
• Databases
– PostgreSQL, MongoDB, Redis
– Hadoop/Hive, Amazon Redshift
– Amazon Elastic MapReduce
Hadoop vs. Amazon Redshift
• Hadoop is a big-data storage and processing engine
platform
– HDFS: data-storage layer
– YARN: resource management
– MapReduce/Pig/Hive/Spark: processing layer
• Amazon Redshift (MPP, massively parallel processing)
– Columnar-storage database. Meant for analytics purpose.
– OLAP – Online Analytics Processing
– Examples: Vertica, Amazon Redshift, Parracel
Recap
Thank You!
http://engineering.viki.com/blog/2014/data-warehouse-and-analytics-infrastructure-at-viki/
http://bit.ly/viki-datawarehouse
engineering.viki.com