building web analytics on hadoop at cbs interactive

Thursdays9:00 ET/PT

Building Web Analytics on Hadoop at CBS Interactive

Michael Sun

[email protected]

Big Data Workshop, Boston 03/10/2012

Brands and Websites of CBS interactive, Samples

GAMES & MOVIES TECH, BIZ & NEWS SPORTSENTERTAINMENT

MUSIC

CBSi Scale

• Top 20 global web property• 235M worldwide monthly unique users• Hadoop Cluster size:

– Currently workers: 40 nodes (260 TB)– This month: add 24 nodes, 800 TB total– Next quarter: ~ 80 nodes, ~ 1 PB

• DW peak processing: > 500M events/day globally, doubling next quarter (ad logs)

1 - Source: comScore, March 2011

Web Analytics Processing

• Collect web logs for web metrics analysis– Web logs by tracking clicks, page views, downloads,

streaming video events, ad events, etc

• Provide internal metrics for web sites monitoring• A/B testing• Billers apps, external reporting• Ad event tracking to support sales• Provide data service

– Support marketing by providing data for data mining– User-centric datastore (stay tuned)– Optimize user experience

1 - Source: comScore, March 2011

2105595680218152960 125.83.8.253 - - [07/Mar/2012:16:00:00 +0000] GET /clear/c.gif?ts=1331136009989&sid=115&ld=www.xcar.com.cn&ldc=a4887174-530c-40df-8002-e06c199ba81a&xrq=fid%3D523%26page%3D10&brflv=10.3.183&brwinsz=1680x840&brscrsz=1680x1050&brlang=zh-CN&tcset=utf8&im=dwjs&xref=http%3A%2F%2Fwww.xcar.com.cn%2Fbbs%2Fforumdisplay.php&srcurl=http%3A%2F%2Fwww.xcar.com.cn%2Fbbs%2Fforumdisplay.php%3Ffid%3D523%26page%3D11&title=%E5%B8%95%E6%9D%B0%E7%BD%97%E8%AE%BA%E5%9D%9B_%E5%B8%95%E6%9D%B0%E7%BD%97%E7%A4%BE%E5%8C%BA_%E5%B8%95%E6%9D%B0%E7%BD%97%E8%BD%A6%E5%8F%8B%E4%BC%9A_PAJERO%E8%AE%BA%E5%9D%9B_XCAR%20%E7%88%B1%E5%8D%A1%E6%B1%BD%E8%BD%A6%E4%BF%B1%E4%B9%90%E9%83%A8 HTTP/1.1 200 42 clgf=Cg+5E02cT/eWAAAAo0Y http://www.xcar.com.cn/bbs/forumdisplay.php?fid=523&page=11 Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.802.30 Safari/535.1 SE 2.X MetaSr 1.0 - 1

schemas.append(Schema(( # schemas[0] SchemaField('web_event_id', 'int', nullable=False, signed=True, bits=64), SchemaField('ip_address', 'string', nullable=False, maxlen=15, io_encoding='ascii'), SchemaField('empty1', 'string', nullable=False, maxlen=5, io_encoding='ascii'), SchemaField('empty2', 'string', nullable=True, maxlen=5, io_encoding='ascii'), SchemaField('req_date', 'string', nullable=True, maxlen=30, io_encoding='ascii'), SchemaField('request', 'string', nullable=True, maxlen=2000, on_range_error='truncate', io_encoding='ascii'), SchemaField('http_status', 'int', nullable=True, signed=True), SchemaField('bytes_sent', 'int', nullable=True, signed=True), SchemaField('cookie', 'string', nullable=True, maxlen=100, on_range_error='truncate', io_encoding='utf-8'), SchemaField('referrer', 'string', nullable=True, maxlen=1000, on_range_error='truncate', io_encoding='utf-8'), SchemaField('user_agent', 'string', nullable=True, maxlen=2000, on_range_error='truncate', io_encoding='utf-8'), SchemaField('is_clear_gif_mask', 'int', nullable=False, on_null='default', on_type_error='default', signed=True, bits=2))))

Modernize the platform

• The web log processing using a proprietary platform ran into the limit– Code base was 10 years old– The version we used vendor is no longer supporting– Not fault-tolerant– Upgrade to the newer version not cost-effective

• Data volume is increasing all the time– 300+ web sites– Video tracking increasing the fastest– To support new initiatives of business

• Use open source systems as much as possible

Hadoop to the Rescue / Research

• Open-source: scalable data processing framework based on MapReduce

• Processing PB of data using Hadoop Distributed files system (HDFS)– high throughput– Fault-Tolerant

• Distributed computing model – Functional programming model based

− MapReduce (M|S|R)

• Execution engine – Used as a cluster for ETL– Collect data (distributed harvester)– Analyze data (M/R, streaming + scripting + R, Pig/Hive)– Archive data (distributed archive)

The Plan• Build web logs collection (codename Fido)

– Apache web log piped to cronolog– Hourly M/R collector job to

− Gzip hourly log files & checksum− Scp from web servers to Hadoop datanodes− Put on HDFS

• Build Python ETL framework (codename Lumberjack)– Based stdin/stdout streaming, one process/one thread– Can run stand-alone or on Hadoop– Pipeline– Filter– Schema

• Build web log processing with Lumberjack– Parse

– Sessionize

– Lookup

– Format data/Load to DB

Hadoop

External data sources

Web Analytics

HDFS

Python-ETLMapReduceHive

DW Database

Sites

Apache Logs

Distribute log by Fido

Web metrics

Billers

Data mining

CMS Systems

Clickmap

Web log Processing by Hadoop Streaming and Python-ETL

• Parsing web logs– IAB filtering and checking– Parsing user agents by regex– IP range lookup– Look up product key etc

• Sessionization– Prepare Sessionize– Sessionize– Filter-unpack

• Process huge dimensions, URL/Page Title• Load Facts

– Format Load data/Load data to DB

Benefits to Ops

• Processing time to reaching SLA, saving 6 hours• Running 2 years in production without any big issues• Withstood the test of 50% / year data volume increase• Architecture by design made easy to add new

processing logic• Robust and Fault-Tolerant

– Five dead datanodes, jobs still ran OK– Upgraded JVM on a few datanodes while jobs running– Reprocessing old data while processing data of current day

Conclusions I – Create Tool Appropriate to the Job if it doesn’t have what you want

• Python ETL Framework and Hadoop Streaming together can do complex, big volume ETL work

• Python ETL Framework– Home grown, under review for open-source release– Rich functionalities by Python– Extensible– NLS support

• Put on top of another platform, eg Hadoop– Distributed/Parallel– Sorting– Aggregation

Conclusions II – Power and Flexibility for Processing Big Data

• Hadoop - scale and computing horsepower– Robustness– Fault-tolerance– Scalability– Significant reduction of processing time to reach SLA– Cost-effective

− Commodity HW− Free SW

• Currently:– Build Multi-tenant Hadoop clusters using Fair Scheduler

Questions?

[email protected]

Follow up [email protected]

building web analytics on hadoop at cbs interactive

Documents