building web analytics on hadoop at cbs interactive
DESCRIPTION
Building Web Analytics on Hadoop at CBS Interactive. Michael Sun [email protected] Big Data Workshop, Boston 03/10/2012. Brands and Websites of CBS interactive, Samples. ENTERTAINMENT . GAMES & MOVIES . SPORTS. TECH, BIZ & NEWS . MUSIC. CBSi Scale. - PowerPoint PPT PresentationTRANSCRIPT
Thursdays9:00 ET/PT
Building Web Analytics on Hadoop at CBS Interactive
Michael Sun
Big Data Workshop, Boston 03/10/2012
Brands and Websites of CBS interactive, Samples
GAMES & MOVIES TECH, BIZ & NEWS SPORTSENTERTAINMENT
MUSIC
CBSi Scale
• Top 20 global web property• 235M worldwide monthly unique users• Hadoop Cluster size:
– Currently workers: 40 nodes (260 TB)– This month: add 24 nodes, 800 TB total– Next quarter: ~ 80 nodes, ~ 1 PB
• DW peak processing: > 500M events/day globally, doubling next quarter (ad logs)
1 - Source: comScore, March 2011
Web Analytics Processing
• Collect web logs for web metrics analysis– Web logs by tracking clicks, page views, downloads,
streaming video events, ad events, etc
• Provide internal metrics for web sites monitoring• A/B testing• Billers apps, external reporting• Ad event tracking to support sales• Provide data service
– Support marketing by providing data for data mining– User-centric datastore (stay tuned)– Optimize user experience
1 - Source: comScore, March 2011
2105595680218152960 125.83.8.253 - - [07/Mar/2012:16:00:00 +0000] GET /clear/c.gif?ts=1331136009989&sid=115&ld=www.xcar.com.cn&ldc=a4887174-530c-40df-8002-e06c199ba81a&xrq=fid%3D523%26page%3D10&brflv=10.3.183&brwinsz=1680x840&brscrsz=1680x1050&brlang=zh-CN&tcset=utf8&im=dwjs&xref=http%3A%2F%2Fwww.xcar.com.cn%2Fbbs%2Fforumdisplay.php&srcurl=http%3A%2F%2Fwww.xcar.com.cn%2Fbbs%2Fforumdisplay.php%3Ffid%3D523%26page%3D11&title=%E5%B8%95%E6%9D%B0%E7%BD%97%E8%AE%BA%E5%9D%9B_%E5%B8%95%E6%9D%B0%E7%BD%97%E7%A4%BE%E5%8C%BA_%E5%B8%95%E6%9D%B0%E7%BD%97%E8%BD%A6%E5%8F%8B%E4%BC%9A_PAJERO%E8%AE%BA%E5%9D%9B_XCAR%20%E7%88%B1%E5%8D%A1%E6%B1%BD%E8%BD%A6%E4%BF%B1%E4%B9%90%E9%83%A8 HTTP/1.1 200 42 clgf=Cg+5E02cT/eWAAAAo0Y http://www.xcar.com.cn/bbs/forumdisplay.php?fid=523&page=11 Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.802.30 Safari/535.1 SE 2.X MetaSr 1.0 - 1
schemas.append(Schema(( # schemas[0] SchemaField('web_event_id', 'int', nullable=False, signed=True, bits=64), SchemaField('ip_address', 'string', nullable=False, maxlen=15, io_encoding='ascii'), SchemaField('empty1', 'string', nullable=False, maxlen=5, io_encoding='ascii'), SchemaField('empty2', 'string', nullable=True, maxlen=5, io_encoding='ascii'), SchemaField('req_date', 'string', nullable=True, maxlen=30, io_encoding='ascii'), SchemaField('request', 'string', nullable=True, maxlen=2000, on_range_error='truncate', io_encoding='ascii'), SchemaField('http_status', 'int', nullable=True, signed=True), SchemaField('bytes_sent', 'int', nullable=True, signed=True), SchemaField('cookie', 'string', nullable=True, maxlen=100, on_range_error='truncate', io_encoding='utf-8'), SchemaField('referrer', 'string', nullable=True, maxlen=1000, on_range_error='truncate', io_encoding='utf-8'), SchemaField('user_agent', 'string', nullable=True, maxlen=2000, on_range_error='truncate', io_encoding='utf-8'), SchemaField('is_clear_gif_mask', 'int', nullable=False, on_null='default', on_type_error='default', signed=True, bits=2))))
Modernize the platform
• The web log processing using a proprietary platform ran into the limit– Code base was 10 years old– The version we used vendor is no longer supporting– Not fault-tolerant– Upgrade to the newer version not cost-effective
• Data volume is increasing all the time– 300+ web sites– Video tracking increasing the fastest– To support new initiatives of business
• Use open source systems as much as possible
Hadoop to the Rescue / Research
• Open-source: scalable data processing framework based on MapReduce
• Processing PB of data using Hadoop Distributed files system (HDFS)– high throughput– Fault-Tolerant
• Distributed computing model – Functional programming model based
− MapReduce (M|S|R)
• Execution engine – Used as a cluster for ETL– Collect data (distributed harvester)– Analyze data (M/R, streaming + scripting + R, Pig/Hive)– Archive data (distributed archive)
The Plan• Build web logs collection (codename Fido)
– Apache web log piped to cronolog– Hourly M/R collector job to
− Gzip hourly log files & checksum− Scp from web servers to Hadoop datanodes− Put on HDFS
• Build Python ETL framework (codename Lumberjack)– Based stdin/stdout streaming, one process/one thread– Can run stand-alone or on Hadoop– Pipeline– Filter– Schema
• Build web log processing with Lumberjack– Parse
– Sessionize
– Lookup
– Format data/Load to DB
Hadoop
External data sources
Web Analytics
HDFS
Python-ETLMapReduceHive
DW Database
Sites
Apache Logs
Distribute log by Fido
Web metrics
Billers
Data mining
CMS Systems
Clickmap
Web log Processing by Hadoop Streaming and Python-ETL
• Parsing web logs– IAB filtering and checking– Parsing user agents by regex– IP range lookup– Look up product key etc
• Sessionization– Prepare Sessionize– Sessionize– Filter-unpack
• Process huge dimensions, URL/Page Title• Load Facts
– Format Load data/Load data to DB
Benefits to Ops
• Processing time to reaching SLA, saving 6 hours• Running 2 years in production without any big issues• Withstood the test of 50% / year data volume increase• Architecture by design made easy to add new
processing logic• Robust and Fault-Tolerant
– Five dead datanodes, jobs still ran OK– Upgraded JVM on a few datanodes while jobs running– Reprocessing old data while processing data of current day
Conclusions I – Create Tool Appropriate to the Job if it doesn’t have what you want
• Python ETL Framework and Hadoop Streaming together can do complex, big volume ETL work
• Python ETL Framework– Home grown, under review for open-source release– Rich functionalities by Python– Extensible– NLS support
• Put on top of another platform, eg Hadoop– Distributed/Parallel– Sorting– Aggregation
Conclusions II – Power and Flexibility for Processing Big Data
• Hadoop - scale and computing horsepower– Robustness– Fault-tolerance– Scalability– Significant reduction of processing time to reach SLA– Cost-effective
− Commodity HW− Free SW
• Currently:– Build Multi-tenant Hadoop clusters using Fair Scheduler