the "big data" ecosystem at linkedin

Post on 11-May-2015

919 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

[This work was presented at SIGMOD'13.] The use of large-scale data mining and machine learning has proliferated through the adoption of technologies such as Hadoop, with its simple programming semantics and rich and active ecosystem. This paper presents LinkedIn's Hadoop-based analytics stack, which allows data scientists and machine learning researchers to extract insights and build product features from massive amounts of data. In particular, we present our solutions to the "last mile" issues in providing a rich developer ecosystem. This includes easy ingress from and egress to online systems, and managing workflows as production processes. A key characteristic of our solution is that these distributed system concerns are completely abstracted away from researchers. For example, deploying data back into the online system is simply a 1-line Pig command that a data scientist can add to the end of their script. We also present case studies on how this ecosystem is used to solve problems ranging from recommendations to news feed updates to email digesting to descriptive analytical dashboards for our members.

TRANSCRIPT

The "Big Data" Ecosystem at LinkedInSIGMOD 2013Roshan Sumbaly, Jay Kreps, & Sam ShahJune 2013

©2012 LinkedIn Corporation. All Rights Reserved.

2

LinkedIn: the professional profile of record

225MMembers 225M MemberProfiles

1 2

3

Applications

4

Application examples

People You May Know (2 people) Year In Review Email (1 person, 1 month) Skills and Endorsements (2 people) Network Updates Digest (1 person, 3

months) Who’s Viewed My Profile (2 people) Collaborative Filtering (1 person) Related Searches (1 person, 3 months) and more…

5

Skill sets

©2013 LinkedIn Corporation. All Rights Reserved. 6

Rich Hadoop-based ecosystem

©2013 LinkedIn Corporation. All Rights Reserved. 7

“Last mile” problems

Ingress– Moving data from online to offline system

Workflow management– Managing offline processes

Egress– Moving results from offline to online

systems Key/Value Streams OLAP

8

Application examples

People You May Know (2 people) Year In Review Email (1 person, 1 month) Skills and Endorsements (2 people) Network Updates Digest (1 person, 3

months) Who’s Viewed My Profile (2 people) Collaborative Filtering (1 person) Related Searches (1 person, 3 months) and more…

9

People You May Know

10

People You May Know – Workflow

Perform triangle closing for all

members

Triangle closing

Rank by discounting previously shown recommendations

Push recommendations to

online service

Connection stream

Impression stream

©2013 LinkedIn Corporation. All Rights Reserved. 11

“Last mile” problems

Ingress– Moving data from online to offline system

Workflow management– Managing offline processes

Egress– Moving results from offline to online

systems Key/Value Streams OLAP

©2013 LinkedIn Corporation. All Rights Reserved. 12

Ingress - O(n2) data integration complexity

Point to point Fragile, delayed and potentially lossy Non-standardized

©2013 LinkedIn Corporation. All Rights Reserved. 13

Ingress - O(n) data integration

14

Ingress – Kafka

Distributed and elastic– Multi-broker system

Categorized topics– “PeopleYouMayKnowTopic”– “ConnectionUpdateTopic”

15

Ingress

Standardized schemas– Avro– Central repository– Programmatic compatibility

Audited ETL to Hadoop

©2013 LinkedIn Corporation. All Rights Reserved. 16

“Last mile” problems

Ingress– Moving data from online to offline system

Workflow management– Managing offline processes

Egress– Moving results form offline to online

systems Key/Value Streams OLAP

17

People You May Know – Workflow

Perform triangle closing for all

members

Rank by discounting previously shown recommendations

Push recommendations to

online service

Connection stream

Impression stream

18

People You May Know – Workflow (in reality)

19

Workflow Management - Azkaban

Dependency management– Historical logs

Diverse job types– Pig, Hive, Java

Scheduling Monitoring Visualization Configuration Retry/restart on failure Resource locking

20

People You May Know – Workflow

Perform triangle closing for all

members

Rank by discounting previously shown recommendations

Push recommendations to

online service

Connection stream

Impression stream

Member Id 1213 => [ Recommended member id 1734,

Recommended member id 1523 … Recommended member id 6332 ]

©2013 LinkedIn Corporation. All Rights Reserved. 21

“Last mile” problems

Ingress– Moving data from online to offline system

Workflow management– Managing offline processes

Egress– Moving results from offline to online

systems Key/Value Streams OLAP

22

Egress – Key/Value

Voldemort– Based on Amazon’s Dynamo

Distributed and Elastic Horizontally scalable Bulk load pipeline from Hadoop Simple to usestore results into ‘url’ using KeyValue(‘member_id’)

23

People You May Know - Summary

24

Application examples

People You May Know (2 people) Year In Review Email (1 person, 1 month) Skills and Endorsements (2 people) Network Updates Digest (1 person, 3

months) Who’s Viewed My Profile (2 people) Collaborative Filtering (1 person) Related Searches (1 person, 3 months) and more…

25

Year In Review Email

26

Year In Review EmailmemberPosition = LOAD '$latest_positions' USING BinaryJSON;memberWithPositionsChangedLastYear = FOREACH ( FILTER memberPosition BY ((start_date >= $start_date_low ) AND (start_date <= $start_date_high))) GENERATE member_id, start_date, end_date;

allConnections = LOAD '$latest_bidirectional_connections' USING BinaryJSON;

allConnectionsWithChange_nondistinct = FOREACH ( JOIN memberWithPositionsChangedLastYear BY member_id, allConnections BY dest ) GENERATE allConnections::source AS source, allConnections::dest AS dest;

allConnectionsWithChange = DISTINCT allConnectionsWithChange_nondistinct;

memberinfowpics = LOAD '$latest_memberinfowpics' USING BinaryJSON;pictures = FOREACH ( FILTER memberinfowpics BY ((cropped_picture_id is not null) AND ( (member_picture_privacy == 'N') OR (member_picture_privacy == 'E'))) ) GENERATE member_id, cropped_picture_id, first_name as dest_first_name, last_name as dest_last_name;

resultPic = JOIN allConnectionsWithChange BY dest, pictures BY member_id;connectionsWithChangeWithPic = FOREACH resultPic GENERATE allConnectionsWithChange::source AS source_id, allConnectionsWithChange::dest AS member_id, pictures::cropped_picture_id AS pic_id, pictures::dest_first_name AS dest_first_name, pictures::dest_last_name AS dest_last_name;

joinResult = JOIN connectionsWithChangeWithPic BY source_id, memberinfowpics BY member_id; withName = FOREACH joinResult GENERATE

connectionsWithChangeWithPic::source_id AS source_id, connectionsWithChangeWithPic::member_id AS member_id, connectionsWithChangeWithPic::dest_first_name as first_name, connectionsWithChangeWithPic::dest_last_name as last_name, connectionsWithChangeWithPic::pic_id AS pic_id, memberinfowpics::first_name AS firstName, memberinfowpics::last_name AS lastName, memberinfowpics::gmt_offset as gmt_offset, memberinfowpics::email_locale as email_locale, memberinfowpics::email_address as email_address;

resultGroup = GROUP withName BY (source_id, firstName, lastName, email_address, email_locale, gmt_offset);

-- Get the count of results per recipientresultGroupCount = FOREACH resultGroup GENERATE group , withName as toomany, COUNT_STAR(withName) as num_results;resultGroupPre = filter resultGroupCount by num_results > 2;resultGroup = FOREACH resultGroupPre { withName = LIMIT toomany 64; GENERATE group, withName, num_results;}

x_in_review_pre_out = FOREACH resultGroup GENERATE FLATTEN(group) as (source_id, firstName, lastName, email_address, email_locale, gmt_offset), withName.(member_id, pic_id, first_name, last_name) as jobChanger, '2013' as changeYear:chararray, num_results as num_results;

x_in_review = FOREACH x_in_review_pre_out GENERATE source_id as recipientID, gmt_offset as gmtOffset, firstName as first_name, lastName as last_name, email_address, email_locale, TOTUPLE( changeYear, source_id,firstName, lastName, num_results,jobChanger) as body;

rmf $xir;STORE x_in_review INTO '$url' USING Kafka();

27

Year In Review Email – Workflow

Find users that have changed jobs

Join with connections and metadata

(pictures)

Group by connections of these users

Push content to email service

©2013 LinkedIn Corporation. All Rights Reserved. 28

“Last mile” problems

Ingress– Moving data from online to offline system

Workflow management– Managing offline processes

Egress– Moving results from offline to online

systems Key/Value Streams OLAP

29

Egress - Streams

Service acts as consumer “EmailContentTopic”store emails into ‘url’ using Stream(“topic=x“)

30

Conclusion

Hadoop: simple programmatic model, rich developer ecosystem

Primitives for – Ingress:

Structured, complete data available Automatically handles data evolution

– Workflow management Run and operate production processes

– Egress 1-line command for data for exporting data Horizontally scalable, little need for capacity planning

Empowers data scientists to focus on new product ideas, not infrastructure

Future work: models of computation

• Alternating Direction Method of Multipliers (ADMM)

• Distributed Conjugate Gradient Descent (DCGD)• Distributed L-BFGS• Bayesian Distributed Learning (BDL)

Graphs

Distributed learning

Near-line processing

32

data.linkedin.com

top related