geetu ambwani, principal data scientist, huffington post at mlconf nyc - 4/15/16

20
Data Science in the Newsroom Geetu Ambwani Principal Data Scientist [email protected]

Upload: mlconf

Post on 13-Apr-2017

599 views

Category:

Technology


0 download

TRANSCRIPT

Data Science in the Newsroom

Geetu AmbwaniPrincipal Data Scientist

[email protected]

What is the Huffington Post?Founded May 2005

Ranking among Digital-only news websites 1

Cross-platform monthly unique visitors Over 187 Million

Number of articles per day Over 500

Number of international editions 15

Bloggers Over 100,000

News Industry - Trends HuffPost has consistently been an innovator in the digital publishing space.

Massive Blogging Network:

More than 100K bloggers across the globe

News Industry - Trends HuffPost has consistently been an innovator in the digital publishing space.

Google Site Rank

News Industry - Trends HuffPost has consistently been an innovator in the digital publishing space.

Biggest Social publisher

News Industry - Challenges

How Can Data Help ?

Ad campaignsInternational editionsSocial media promotion

Editors

User-experience

Blog moderators

Reporters

HuffPost Studio

Content Lifecycle

DistributionCreation Consumption

Content Creation: How Can Data Help ? ● Tools to help surface, discover trends in different parts of the web ● Content Enhancement with multimedia based on semantic matching (images, slideshows, videos)● Optimizing headlines/images (RobinHood Platform)

Content Gap: Production Versus Consumption

HuffPost data (April 9-10, 2016)

Content Consumption: How Can Data Help?

Know Your Audience

● User Cohorts: ○ Social Traffic versus FrontPage Clickers consume different content○ Desktop Vs Mobile consumption

● Recommendations/Personalization● Can we use data to inform product design and interface ?

○ Rearrange share buttons based on traffic origin (Facebook vs Pinterest)

Content Lifecycle

DistributionCreation Consumption

Content Distribution: Can Data Help ?● People’s attention is increasingly concentrated on social streams

○ More traffic to publishers from social than any other way

● Are Distributed Platforms the new home page ? ○ Facebook Instant, Apple News, Snapchat Discover, Google Amp○ Messenger Bots

● You need to be where your audience is: ○ Identify the content mix that is maximally engaging on an external platform

○ Can we use data to seed these distribution networks ? (Facebook HuffPost Pages, Snapchat Discover)

Content Distribution: Can Data Help ?

● HuffPost produces 1000 articles a day - which of these do we promote ?● Article PVs follow a very skewed distribution of success

○ Only 1% of our articles > 100k PVs ● Content performs differently on different networks. ● Can we predict the articles that will get traction in advance so

■ We can optimally seed multiple distribution channels (Facebook HP Pages, Snapchat Discover)

■ Target for premium/high value ads to maximize revenue ■ Populate Recommendation Widgets

Content Distribution: Can Data Help ?Challenges

● Histogram of traffic distribution - highly skewed. ● The very act of promoting something causes a bump in traffic. ● Data normalization - how long do want to wait before predicting ? ● Very imbalanced data set

Our Approach

● Random Forest classifier. ● Multiple success criteria● Historical examples of (+) and (-) articles. Downsampling.● Different normalization thresholds● Feature engineering: traffic growth ratios; initial organic social traffic per minute; distinct referrers;

Slackbot for the social promotion team

● 20% lift in PVs per predicted article

● 20% lift in PVs per predicted article

ConclusionA Data Driven Newsroom today means

● More than just keeping track of clicks and shares● Using predictive analytics to drive product and content placement

Machine Learning will be a key driver for success with the advent of distributed content

Thanks !MachineLearning@HuffPost