data science in the newsroom
TRANSCRIPT
Data Science in the Newsroom
Geetu AmbwaniPrincipal Data Scientist
MLconf NYC, April 2016
What is the Huffington Post?Founded May 2005
Ranking among Digital-only news websites 1
Cross-platform monthly unique visitors Over 187 Million
Number of articles per day Over 500
Number of international editions 15
Bloggers Over 100,000
News Industry - Trends HuffPost has consistently been an innovator in the digital publishing space.
Massive Blogging Network:
More than 100K bloggers across the globe
News Industry - Trends HuffPost has consistently been an innovator in the digital publishing space.
Google Site Rank
News Industry - Trends HuffPost has consistently been an innovator in the digital publishing space.
Biggest Social publisher
Ad campaignsInternational editionsSocial media promotion
Editors
User-experience
Blog moderators
Reporters
HuffPost Studio
Content Creation: How Can Data Help ? ● Tools to help surface, discover trends in different parts of the web ● Content Enhancement with multimedia based on semantic matching (images, slideshows, videos)● Optimizing headlines/images (RobinHood Platform)
Content Consumption: How Can Data Help?
Know Your Audience
● User Cohorts: ○ Social Traffic versus FrontPage Clickers consume different content○ Desktop Vs Mobile consumption
● Recommendations/Personalization● Can we use data to inform product design and interface ?
○ Rearrange share buttons based on traffic origin (Facebook vs Pinterest)
Content Distribution: Can Data Help ?● People’s attention is increasingly concentrated on social streams
○ More traffic to publishers from social than any other way
● Are Distributed Platforms the new home page ? ○ Facebook Instant, Apple News, Snapchat Discover, Google Amp○ Messenger Bots
● You need to be where your audience is: ○ Identify the content mix that is maximally engaging on an external platform
○ Can we use data to seed these distribution networks ? (Facebook HuffPost Pages, Snapchat Discover)
Content Distribution: Can Data Help ?
● HuffPost produces 1000 articles a day - which of these do we promote ?● Article PVs follow a very skewed distribution of success
○ Only 1% of our articles > 100k PVs ● Content performs differently on different networks. ● Can we predict the articles that will get traction in advance so
■ We can optimally seed multiple distribution channels (Facebook HP Pages, Snapchat Discover)
■ Target for premium/high value ads to maximize revenue ■ Populate Recommendation Widgets
Content Distribution: Can Data Help ?Challenges
● Histogram of traffic distribution - highly skewed. ● The very act of promoting something causes a bump in traffic. ● Data normalization - how long do want to wait before predicting ? ● Very imbalanced data set
Our Approach
● Random Forest classifier. ● Multiple success criteria● Historical examples of (+) and (-) articles. Downsampling.● Different normalization thresholds● Feature engineering: traffic growth ratios; initial organic social traffic per minute; distinct referrers;
ConclusionA Data Driven Newsroom today means
● More than just keeping track of clicks and shares● Using predictive analytics to drive product and content placement
Machine Learning will be a key driver for success with the advent of distributed content