path to 400m members: linkedin’s data powered journey

32
Xin Fu, Carl Steinbach Hadoop Summit Tokyo, October 26, 2016 Path to 400M* Members: LinkedIn’s Data Powered Journey * As of Q2 2016, LinkedIn had 450M members world wide

Upload: hadoop-summit

Post on 07-Jan-2017

144 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Path to 400M Members: LinkedIn’s Data Powered Journey

Xin Fu, Carl Steinbach

Hadoop SummitTokyo, October 26, 2016

Path to 400M* Members: LinkedIn’s Data Powered Journey

* As of Q2 2016, LinkedIn had 450M members world wide

Page 2: Path to 400M Members: LinkedIn’s Data Powered Journey

2

2004

2011 2012

2009

2012 2015

Page 3: Path to 400M Members: LinkedIn’s Data Powered Journey

3

Real Time Visualization of New Sign-ups

Page 4: Path to 400M Members: LinkedIn’s Data Powered Journey

What Does “Data-Driven” Mean at LinkedIn?

4

Page 5: Path to 400M Members: LinkedIn’s Data Powered Journey

What Does “Data-Driven” Mean at LinkedIn?

5

Page 6: Path to 400M Members: LinkedIn’s Data Powered Journey

Monitoring & Learning

6

Page 7: Path to 400M Members: LinkedIn’s Data Powered Journey

What is This Phase Comprised of?

7

● Dashboards● Reports

● Trend explanation

○ Short term fluctuation: investigation

○ Long term trend: strategic analysis

Page 8: Path to 400M Members: LinkedIn’s Data Powered Journey

Past Challenges

8

Reliability● Easily broken without operational support, huge time spent in

maintenance

Diverse technology● Self maintained pipelines● Various UIs with different visualization capabilities● Redundant computation

Page 9: Path to 400M Members: LinkedIn’s Data Powered Journey

Standardized Reporting Tool

9

● Reduces dependency on 3rd party BI tools● Closer integration with LinkedIn’s ecosystem of experimentation

and anomaly detection solutions

Page 10: Path to 400M Members: LinkedIn’s Data Powered Journey

Towards Real Time Monitoring

10

Sign

-up

Country

Platform

Language

Browser

Signup Type

OS

Page 11: Path to 400M Members: LinkedIn’s Data Powered Journey

Experimentation & Analysis

11

Page 12: Path to 400M Members: LinkedIn’s Data Powered Journey

What is This Phase Comprised of?

12

● Experiment design● Experiment analysis to inform ramp decisions

● Learning from multiple experiments to identify what works and what doesn’t work

Page 13: Path to 400M Members: LinkedIn’s Data Powered Journey

Past Challenges

13

Experiment design● Interaction between experiments

Experiment analysis and ramp decision● Manual analysis, extended time-to-

decision● Ramp decisions based on localized

metrics● Reruns needed sometimes due to

undetected errors in setup

Worst of all, some ramps happened without A/B testing● e.g. infrastructural changes

Page 14: Path to 400M Members: LinkedIn’s Data Powered Journey

Experimentation Platform @ LinkedIn

14

● Company-wide platform for A/B testing, ramping, and advanced targeting needs

● Automated reporting and analysis capabilities

Page 15: Path to 400M Members: LinkedIn’s Data Powered Journey

Tiering of Metrics

15

Metrics at different tier:● Different review processes

● Different levels of visibility in dashboards and experiment scorecards

● Different computation priorities and SLAs in data pipelines

● Different life cycles

Page 16: Path to 400M Members: LinkedIn’s Data Powered Journey

Backend Infrastructure for Tracking & Instrumentation

16

Page 17: Path to 400M Members: LinkedIn’s Data Powered Journey

17

InvitationClickEvent()

Scale fact: ~1000 tracking event types, ~20TB per day, hundreds of metrics & data products

Tracking Data Records User Activity

Page 18: Path to 400M Members: LinkedIn’s Data Powered Journey

Tracking Data Lifecycle and Teams

18

Product teams:PMs, Developers, TestEng

Infra teams: Hadoop, Kafka, DWH, ...

Data teams: Analytics, Relevance Engineers,...

Page 19: Path to 400M Members: LinkedIn’s Data Powered Journey

Example: How Do We Track a Profile View?

19

PageViewEventRecord 1:{"header" : {"memberId" : 12345,"time" : 1454745292951,"appName" : {"string" : "LinkedIn"

"pageKey" : "profile_page"},

},"trackingInfo" : {["vieweeID" : "23456"],

...}

}

pageViews = LOAD ‘/data/tracking/PageViewEvent’;

profileViews = FILTER pageViews by header.pageKey==‘profile_page’;

Page 20: Path to 400M Members: LinkedIn’s Data Powered Journey

Example: How Do We Track a Profile View?

20

PageViewEventRecord 1:{"header" : {"memberId" : 12345,"time" : 1454745292951,"appName" : {"string" : "LinkedIn"

"pageKey" : "new_profile_page"},

},"trackingInfo" : {["vieweeID" : "23456"],

...}

}

pageViews = LOAD ‘/data/tracking/PageViewEvent’;

profileViews = FILTER pageViews by header.pageKey==‘profile_page’ or header.pageKey==‘new_profile_page’;

Page 21: Path to 400M Members: LinkedIn’s Data Powered Journey

At Some Point It Becomes Unmaintainable ...

21

Page 22: Path to 400M Members: LinkedIn’s Data Powered Journey

How Do We Handle Old and New?

22

Producers Consumers

Page 23: Path to 400M Members: LinkedIn’s Data Powered Journey

DALI: A Data Access Layer for LinkedInAbstract away underlying physical details to allow users to focus solely on the logical concerns

Logical Tables + Views

Logical FileSystem

We had been working on something that could help...

Page 24: Path to 400M Members: LinkedIn’s Data Powered Journey

24

Data Catalog + Discovery

(DALI)

DaliFileSystem Client

Data Source(HDFS)

Data Sink(HDFS)

Processing Engine(MapReduce, Spark, Presto)

DALI Datasets (Tables + Views)

Query Layers (Hive, Pig, Spark)

View Defs + UDFs(Artifactory, Git)

Dataflow APIs(MR, Spark, Scalding)DALI CLI

DALI: Implementation Details in Context

Page 25: Path to 400M Members: LinkedIn’s Data Powered Journey

Solving with DALI Views

Producers Consumers

Page 26: Path to 400M Members: LinkedIn’s Data Powered Journey

State of the World Today with Dali

~ 100 producer views~ 200 consumer views~ 80 unique tracking event data sources

What’s next?! Views on streaming data! Selective materialization and caching! Open source

Page 27: Path to 400M Members: LinkedIn’s Data Powered Journey

At the Core of “Data-Driven” is ....

27

Page 28: Path to 400M Members: LinkedIn’s Data Powered Journey

28

Used to be Tug of War Between Speed and Quality

Page 29: Path to 400M Members: LinkedIn’s Data Powered Journey

29

Before We Learned that Technology Could Break the Dichotomy Between Speed and Quality

Page 30: Path to 400M Members: LinkedIn’s Data Powered Journey

30

Cultural Aspects: Partnership Data Scientists and Engineers

Page 31: Path to 400M Members: LinkedIn’s Data Powered Journey

Interesting Challenges

- Metric trade-off, e.g. between engagement vs. monetization

- Real-time everything?- A/B test in a social

network- Human judge for

personalized search- Value of an action

31

Page 32: Path to 400M Members: LinkedIn’s Data Powered Journey

It Took a Village

32

Thanks to all the Data Scientists, Engineers and Product partners at LinkedIn for being part of this great journey!

https://engineering.linkedin.com/data